scispace - formally typeset
Search or ask a question

Showing papers on "Edit distance published in 2000"


Proceedings Article
01 May 2000
TL;DR: This paper defines evaluation criteria which are more adequate than pure edit distance and describes how the measurement along these quality criteria is performed semi-automatically in a fast, convenient and above all consistent way using this tool and the corresponding graphical user interface.
Abstract: In this paper we present a tool for the evaluation of translation quality. First, the typical requirements of such a tool in the framework of machine translation (MT) research are discussed. We define evaluation criteria which are more adequate than pure edit distance and we describe how the measurement along these quality criteria is performed semi-automatically in a fast, convenient and above all consistent way using our tool and the corresponding graphical user interface.

318 citations


Journal ArticleDOI
TL;DR: The paper develops the idea of edit-distance originally introduced for graph-matching by Sanfeliu and Fu (1983), and shows how the Levenshtein distance (1966) can be used to model the probability distribution for structural errors in the graph- matching problem.
Abstract: This paper describes a novel framework for comparing and matching corrupted relational graphs. The paper develops the idea of edit-distance originally introduced for graph-matching by Sanfeliu and Fu (1983). We show how the Levenshtein distance (1966) can be used to model the probability distribution for structural errors in the graph-matching problem. This probability distribution is used to locate matches using MAP label updates. We compare the resulting graph-matching algorithm with that recently reported by Wilson and Hancock. The use of edit-distance offers an elegant alternative to the exhaustive compilation of label dictionaries. Moreover, the method is polynomial rather than exponential in its worst-case complexity. We support our approach with an experimental study on synthetic data and illustrate its effectiveness on an uncalibrated stereo correspondence problem. This demonstrates experimentally that the gain in efficiency is not at the expense of quality of match.

157 citations


Proceedings ArticleDOI
01 Feb 2000
TL;DR: The goal is to design communication protocols with the main objective of minimizing the total number of bits they exchange; other objectives are minimizing the number of rounds and the complexity of internal computations.
Abstract: We have two users, A and B, who hold documents x and y respectively. Neither of the users has any information about the other''s document. They exchange messages so that B computes x; it may be required that A compute y as well. Our goal is to design communication protocols with the main objective of minimizing the total number of bits they exchange; other objectives are minimizing the number of rounds and the complexity of internal computations. An important notion which determines the efficiency of the protocols is how one measures the distance between x and y. We consider several metrics for measuring this distance, namely the Hamming metric, the Levenshtein metric (edit distance), and a new LZ metric, which is introduced in this paper. We show how to estimate the distance between x and y using a single message of logarithmic size. For each metric, we present the first communication-efficient protocols, which often match the corresponding lower bounds. A consequence of these are error-correcting codes for these error models which correct up to d errors in n characters using O(d log n) bits. Our most interesting methods use a new histogram transformation that we introduce to convert edit distance to L1 distance.

137 citations


Proceedings ArticleDOI
31 Jul 2000
TL;DR: A new type of grammar learning algorithm, inspired by string edit distance, that takes a corpus of flat sentences as input and returns a Corpus of labelled, bracketed sentences that works on pairs of unstructured sentences.
Abstract: This paper introduces a new type of grammar learning algorithm, inspired by string edit distance (Wagner and Fischer, 1974). The algorithm takes a corpus of flat sentences as input and returns a corpus of labelled, bracketed sentences. The method works on pairs of unstructured sentences that have one or more words in common. When two sentences are divided into parts that are the same in both sentences and parts that are different, this information is used to find parts that are interchangeable. These parts are taken as possible constituents of the same type. After this alignment learning step, the selection learning step selects the most probable constituents from all possible constituents.This method was used to bootstrap structure on the ATIS corpus (Marcus et. al., 1993) and on the OVIS! corpus (Bonnema et al., 1997). While the results are encouraging (we obtained up to 89.25% non-crossing brackets precision), this paper will point out some of the shortcomings of our approach and will suggest possible solutions.

114 citations


Journal ArticleDOI
TL;DR: It is shown that in contrast to regular text, it does make a difference whether the errors occur in the hypertext or the pattern, and a much simpler algorithm is presented achieving the same complexity which runs on any hypertext graph.

60 citations


Journal ArticleDOI
TL;DR: The usefulness of the features derived from interval coding in a hidden Markov model based page layout classification system that is trainable and extendible are demonstrated.
Abstract: This paper describes features and methods for document image comparison and classification at the spatial layout level. The methods are useful for visual similarity based document retrieval as well as fast algorithms for initial document type classification without OCR. A novel feature set called interval encoding is introduced to capture elements of spatial layout. This feature set encodes region layout information in fixed-length vectors by capturing structural characteristics of the image. These fixed-length vectors are then compared to each other through a Manhattan distance computation for fast page layout comparison. The paper describes experiments and results to rank-order a set of document pages in terms of their layout similarity to a test document. We also demonstrate the usefulness of the features derived from interval coding in a hidden Markov model based page layout classification system that is trainable and extendible. The methods described in the paper can be used in various document retrieval tasks including visual similarity based retrieval, categorization and information extraction.

59 citations


Book ChapterDOI
31 May 2000
TL;DR: This work presents an approach to automatically create wrappers by means of an incremental grammar induction algorithm that uses an adaptation of the string edit distance to create such wrappers.
Abstract: To facilitate effective search on the World Wide Web, meta search engines have been developed which do not search the Web themselves, but use available search engines to find the required information. By means of wrappers, meta search engines retrieve information from the pages returned by search engines. We present an approach to automatically create such wrappers by means of an incremental grammar induction algorithm. The algorithm uses an adaptation of the string edit distance. Our method performs well; it is quick, can be used for several types of result pages and requires a minimal amount of user interaction.

45 citations


Proceedings ArticleDOI
01 Sep 2000
TL;DR: Experimental results with synthetic cyclic strings and a handwritten digits recognition task show that the new algorithm is faster than Maes' and Gregor and Thomason's (1993) algorithms.
Abstract: A new algorithm to compute the edit distance between cyclic strings is presented. Experimental results with synthetic cyclic strings and a handwritten digits recognition task show that the new algorithm is faster than Maes' (1990) and Gregor and Thomason's (1993) algorithms.

39 citations


01 Jan 2000
TL;DR: This work gives provably better algorithms for normalized edit distance computation with proven complexity bounds: an -time algorithm when the cost function is uniform, i.e, the weights of edit operations depend only on the type but not on the individual symbols involved, and an - time algorithms when the weights are rational.
Abstract: A common model for computing the similarity of two strings and of lengths and respectively, with , is to transform into through a sequence of edit operations, called an edit sequence. The edit operations are of three types: insertion, deletio n, and substitu- tion. A given cost function assigns a weight to each edit operation. The amortized weight for a given edit sequence is the ratio of its weight to its length, a nd the minimum of this ratio over all edit sequences is the normalized edit distance. Existing algorithms for normalized edit distance computation with proven complexity bounds require time in the worst-case. We give provably better algorithms: an -time algorithm when the cost function is uniform, i.e, the weights of edit operations depend only on the type but not on the individual symbols involved, and an -time algorithm when the weights are rational.

34 citations


Proceedings ArticleDOI
21 Dec 2000
TL;DR: Of the five methods for matching words mistranslated by optical character recognition to their most likely match in a reference dictionary, the Bayesian algorithm produced the most correct matches (87%), and had the advantage of producing scores that have a useful and practical interpretation.
Abstract: Five methods for matching words mistranslated by optical character recognition to their most likely match in a reference dictionary were tested on data from the archives of the National Library of Medicine. The methods, including an adaptation of the cross correlation algorithm, the generic edit distance algorithm, the edit distance algorithm with a probabilistic substitution matrix, Bayesian analysis, and Bayesian analysis on an actively thinned reference dictionary were implemented and their accuracy rates compared. Of the five, the Bayesian algorithm produced the most correct matches (87%), and had the advantage of producing scores that have a useful and practical interpretation.

30 citations


Patent
Key-Sun Choi1, Byung-Ju Kang1
17 Jan 2000
Abstract: A phonetic distance calculation method for similarity comparison between phonetic transcriptions of foreign words. A system manager defines character element transformation patterns occurrable between phonetic transcriptions derived from the same foreign language. A system generates new phonetic transcriptions according to the defined character element transformation patterns and assigns a demerit mark to each of the generated phonetic transcriptions according to a phonetic distance. A minimum phonetic distance between each of the generated phonetic transcriptions and a given phonetic transcription is calculated on the basis of a minimum edit distance calculation method. Any one of the generated phonetic transcriptions with a smallest one of the calculated minimum phonetic distances is determined to be most similar to the given phonetic transcription. Therefore, a document retrieval operation can accurately be performed in a document retrieval system and a document retrieval time can be reduced therein, resulting in a significant improvement in the performance of the document retrieval system.

Book ChapterDOI
21 Jun 2000
TL;DR: As a solution for the edit distance between A and B, the difference representation of the D-table is defined, which leads to a simple and intuitive algorithm for the incremental/decremental edit distance problem.
Abstract: In this paper we consider the incremental/decremental version of the edit distance problem: given a solution to the edit distance between two strings A and B, find a solution to the edit distance between A and B′ where B = aB (incremental) or bB′ = B (decremental). As a solution for the edit distance between A and B, we define the difference representation of the D-table, which leads to a simple and intuitive algorithm for the incremental/decremental edit distance problem.

Proceedings ArticleDOI
10 Jul 2000
TL;DR: The algorithm and architecture of a processor for approximate string matching with high throughput rate is presented, dedicated for multimedia and information retrieval applications working on huge amounts of mass data where short response times are necessary.
Abstract: In this paper we present the algorithm and architecture of a processor for approximate string matching with high throughput rate. The processor is dedicated for multimedia and information retrieval applications working on huge amounts of mass data where short response times are necessary. The algorithm used for the approximate string matching is based on a dynamic programming procedure known as the string-to-string correction problem. It has been extended to fulfil the requirements of full text search in a database system, including string matching with wildcards and handling of idiomatic turns of some languages. The processor has been fabricated in a 0.6 /spl mu/m CMOS technology. It performs a maximum of 8.5 billion character comparisons per second when operating at the specified clock frequency of 132 MHz.

Proceedings ArticleDOI
31 Jul 2000
TL;DR: This research looks at the effects of word order and segmentation on translation retrieval performance for an experimental Japanese-English translation memory system, and indicates that character-based indexing is consistently superior to word- based indexing, suggesting that segmentation is an unnecessary luxury in the given domain.
Abstract: This research looks at the effects of word order and segmentation on translation retrieval performance for an experimental Japanese-English translation memory system. We implement a number of both bag-of-words and word order-sensitive similarity metrics, and test each over character-based and word-based indexing. The translation retrieval performance of each system configuration is evaluated empirically through the notion of word edit distance between translation candidate outputs and the model translation. Our results indicate that character-based indexing is consistently superior to word-based indexing, suggesting that segmentation is an unnecessary luxury in the given domain. Word order-sensitive approaches are demonstrated to generally outperform bag-of-words methods, with source language segment-level edit distance proving the most effective similarity metric.

Journal ArticleDOI
TL;DR: This paper presents a new algorithm guaranteed to find the optimal alignment for three sequences using linear gap costs and uses a speed-up technique based on Ukkonen's greedy algorithm which he presented for two sequences and simple costs.

Patent
Alexander Birman1, Harry R. Gail1, Sidney L. Hantler1, George B. Leeman1, Daniel Milch1 
20 Sep 2000
TL;DR: In this article, a very fast method for correcting the spelling of a word or phrase in a document proceeds in two steps: first applying a fast approximate method for eliminating most candidate words from consideration (without computing the exact edit distance between the given word whose spelling is to be corrected and any candidate word), followed by a "slow method" which computes the exact editing distance between a given word and each of the few remaining candidate words.
Abstract: A very fast method for correcting the spelling of a word or phrase in a document proceeds in two steps: first applying a very fast approximate method for eliminating most candidate words from consideration (without computing the exact edit distance between the given word whose spelling is to be corrected and any candidate word), followed by a “slow method” which computes the exact edit distance between the word whose spelling is to be corrected and each of the few remaining candidate words. The combination results in a method that is almost as fast as the fast approximate method and as exact as the slow method.

Proceedings ArticleDOI
08 Nov 2000
TL;DR: The authors present a theoretically founded framework for fuzzy unification and resolution based on edit distance over trees and develop the FURY system, which implements the framework efficiently using dynamic programming.
Abstract: The authors present a theoretically founded framework for fuzzy unification and resolution based on edit distance over trees. Their framework extends classical unification and resolution conservatively. They prove important properties of the framework and develop the FURY system, which implements the framework efficiently using dynamic programming. The authors evaluate the framework and system on a large problem in the bioinformatics domain, that of detecting typographical errors in an enzyme name database.

Patent
06 Jan 2000
TL;DR: A computer method of spelling correction comprises the steps of: a) storing a dictionary of valid words, b) checking each input string to identify input strings not in the dictionary, c) generating test words by a restricted set of edit operations which correct the most common errors comprising insertion, deletion, transposition and substitution, d) comparing the edited input string generated in the preceding step with words stored in a dictionary and e) generating a candidate word or candidate list of the words.
Abstract: A computer method of spelling correction comprises the steps of: a) storing a dictionary of valid words, b) for each input string to be checked comparing the input string to words in the stored dictionary to identify input strings not in the dictionary, c) for each input string not found in the preceding step, generating test words by a restricted set of edit operations which correct the most common errors comprising insertion, deletion, transposition and/or substitution, d) comparing the edited input string generated in the preceding step with words stored in the dictionary and e) generating a candidate word or candidate list of the words.

Book ChapterDOI
TL;DR: Two efficient approximate techniques for measuring dissimilarities between cyclic patterns are presented, inspired on the quadratic time algorithm proposed by Bunke and Buhler, achieving even more accurate solutions.
Abstract: Two efficient approximate techniques for measuring dissimilarities between cyclic patterns are presented. They are inspired on the quadratic time algorithm proposed by Bunke and Buhler. The first technique completes pseudoalignments built by the Bunke and Buhler algorithm (BBA), obtaining full alignments between cyclic patterns. The edit cost of the minimum-cost alignment is given as an upper-bound estimation of the exact cyclic edit distance, which results in a more accurate bound than the lower one obtained by BBA. The second technique uses both bounds to compute a weighted average, achieving even more accurate solutions. Weights come from minimizing the sum of squared relative errors with respect to exact distance values on a training set of string pairs. Experiments were conducted on both artificial and real data, to demonstrate the capabilities of new techniques in both accurateness and quadratic computing time.

Book ChapterDOI
04 Sep 2000
TL;DR: This work proposes COFE, a method for sparse feature extraction which is based on novel random non-linear projections and evaluates Cofe on real data and finds that it performs very well in terms of quality of features extracted, number of distances evaluated,number of database scans performed and total run time.
Abstract: Feature Extraction, also known as Multidimensional Scaling, is a basic primitive associated with indexing, clustering, nearest neighbor searching and visualization. We consider the problem of feature extraction when the data-points are complex and the distance evaluation function is very expensive to evaluate. Examples of expensive distance evaluations include those for computing the Hausdorff distance between polygons in a spatial database, or the edit distance between macromolecules in a DNA or protein database. We propose COFE, a method for sparse feature extraction which is based on novel random non-linear projections. We evaluate Cofe on real data and find that it performs very well in terms of quality of features extracted, number of distances evaluated, number of database scans performed and total run time. We further propose COFE-GR, which matches Cofe in terms of distance evaluations and run-time, but outperforms it in terms of quality of features extracted.

Journal Article
TL;DR: In this paper, the authors proposed COFE, a method for sparse feature extraction which is based on novel random non-linear projections, and evaluated it on real data and found that it performs very well in terms of quality of features extracted, number of distances evaluated, the number of database scans performed and total run time.
Abstract: Feature Extraction, also known as Multidimensional Scaling, is a basic primitive associated with indexing, clustering, nearest neighbor searching and visualization. We consider the problem of feature extraction when the data-points are complex and the distance evaluation function is very expensive to evaluate. Examples of expensive distance evaluations include those for computing the Hausdorff distance between polygons in a spatial database, or the edit distance between macromolecules in a DNA or protein database. We propose COFE, a method for sparse feature extraction which is based on novel random non-linear projections. We evaluate COFE on real data and find that it performs very well in terms of quality of features extracted, number of distances evaluated, number of database scans performed and total run time. We further propose COFE-GR, which matches COFE in terms of distance evaluations and run-time, but outperforms it in terms of quality of features extracted.

Book ChapterDOI
01 Jan 2000
TL;DR: It is demonstrated that the problem of computing a shortest network interconnecting a set of points under a fixed tree topology is polynomial time solvable for some spaces and NP-hard for the others.
Abstract: We discuss the problem of computing a shortest network interconnecting a set of points under a fixed tree topology, and survey the recent algorithmic and complexity results in the literature covering a wide range of metric spaces, including Euclidean, rectilinear, space of sequences with Hamming and edit distances, communication networks, etc It is demonstrated that the problem is polynomial time solvable for some spaces and NP-hard for the others When the problem is NPhard, we attempt to give approximation algorithms with guaranteed relative errors

Book ChapterDOI
21 Jun 2000
TL;DR: This paper presents a novel method for using suffix trees to greatly improve the performance of the Gibbs sampling approach.
Abstract: Gibbs sampling is a local search method that can be used to find novel motifs in a text string. In previous work [8], we have proposed a modified Gibbs sampler that can discover novel gapped motifs of varying lengths and occurrence rates in DNA or protein sequences. The Gibbs sampling method requires repeated searching of the text for the best match to a constantly evolving collection of aligned strings, and each search pass previously required θ(nl) time, where l is the length of the motif and n the length of the original sequence. This paper presents a novel method for using suffix trees to greatly improve the performance of the Gibbs sampling approach.

Book ChapterDOI
TL;DR: This paper shows how graph edit-distance can be used to compute the correspondence probabilities more efficiently and shows that the edit distance method is not only more efficient, but also more accurate than the dictionary-based method.
Abstract: This paper presents work aimed at rendering the dual-step EM algorithm of Cross and Hancock more efficient The original algorithm integrates the processes of point-set alignment and correspondence The consistency of the pattern of correspondence matches on the Delaunay triangulation of the points is used to gate contributions to the expected log-likelihood function for point-set alignment parameters However, in its original form the algorithm uses a dictionary of structure-preserving mappings to asses the consistency of match This proves to be a serious computational bottleneck In this paper, we show how graph edit-distance can be used to compute the correspondence probabilities more efficiently In a sensitivity analysis, we show that the edit distance method is not only more efficient, it is also more accurate than the dictionary-based method

Proceedings ArticleDOI
03 Sep 2000
TL;DR: In this work, as an alternative for initial estimation of edit costs, character confusion probabilities are discussed in the context of edit distances and it is shown how improved estimations for them can be achieved.
Abstract: In this work, as an alternative for initial estimation of edit costs, character confusion probabilities are discussed in the context of edit distances. Thereby, insertions have to be handled carefully and it is shown how improved estimations for them can be achieved. Furthermore, some of the proposed solutions based on joint events leading to inferior models for retrieving the word corresponding to the recognized string at hand from a given lexicon, are discussed.

Proceedings ArticleDOI
13 Jun 2000
TL;DR: A method to capture lexical similarity of a lexicon and reliability of a character recognizer which serve to capture the dynamism of the environment.
Abstract: Recognition using only visual evidence cannot always be successful due to limitations of information and resources available during training. Considering relation among lexicon entries is sometimes useful for decision making. In this paper we present a method to capture lexical similarity of a lexicon and reliability of a character recognizer which serve to capture the dynamism of the environment. A parameter, lexical similarity, is defined by measuring these two factors as edit distance between lexicon entries and separability of each character's recognition results. Our experiments show that a utility function considering lexical similarity in a decision stage can enhance the performance of a conventional word recognizer.

Journal Article
01 Jan 2000-Scopus
TL;DR: This study considers stroke direction and pressure sequence strings of a character as character level image signatures for writer identification and presents the newly defined and modified edit distances depending upon their measurement types.
Abstract: The problem of Writer Identification based on similarity is formalized by defining a distance between character or word level features and finding the most similar writings or all writings which are within a certain threshold distance. Among many features, we consider stroke direction and pressure sequence strings of a character as character level image signatures for writer identification. As the conventional definition of edit distance is not applicable in essence, we present the newly defined and modified edit distances depending upon their measurement types. Finally, we present a prototype stroke directional and pressure sequence string extractor used on the writer identification. The importance of this study is the attempt to give a definition of distance between two characters based on the two types of strings.

Book
07 Jun 2000
TL;DR: Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts and Periods and Quasiperiods Characterization are studied.
Abstract: Invited Lectures.- Identifying and Filtering Near-Duplicate Documents.- Machine Learning for Efficient Natural-Language Processing.- Browsing around a Digital Library: Today and Tomorrow.- Summer School Lectures.- Algorithmic Aspects of Speech Recognition: A Synopsis.- Some Results on Flexible-Pattern Discovery.- Contributed Papers.- Explaining and Controlling Ambiguity in Dynamic Programming.- A Dynamic Edit Distance Table.- Parametric Multiple Sequence Alignment and Phylogeny Construction.- Tsukuba BB: A Branch and Bound Algorithm for Local Multiple Sequence Alignment.- A Polynomial Time Approximation Scheme for the Closest Substring Problem.- Approximation Algorithms for Hamming Clustering Problems.- Approximating the Maximum Isomorphic Agreement Subtree Is Hard.- A Faster and Unifying Algorithm for Comparing Trees.- Incomplete Directed Perfect Phylogeny.- The Longest Common Subsequence Problem for Arc-Annotated Sequences.- Boyer-Moore String Matching over Ziv-Lempel Compressed Text.- A Boyer-Moore Type Algorithm for Compressed Pattern Matching.- Approximate String Matching over Ziv-Lempel Compressed Text.- Improving Static Compression Schemes by Alphabet Extension.- Genome Rearrangement by Reversals and Insertions/Deletions of Contiguous Segments.- A Lower Bound for the Breakpoint Phylogeny Problem.- Structural Properties and Tractability Results for Linear Synteny.- Shift Error Detection in Standardized Exams.- An Upper Bound for Number of Contacts in the HP-Model on the Face-Centered-Cubic Lattice (FCC).- The Combinatorial Partitioning Method.- Compact Suffix Array.- Linear Bidirectional On-Line Construction of Affix Trees.- Using Suffix Trees for Gapped Motif Discovery.- Indexing Text with Approximate q-Grams.- Simple Optimal String Matching Algorithm.- Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts.- Periods and Quasiperiods Characterization.- Finding Maximal Quasiperiodicities in Strings.- On the Complexity of Determining the Period of a String.

Book ChapterDOI
24 Jul 2000
TL;DR: This paper generalizes the Myers result, and characterize a class of automata for which there exists equivalent parallel, or vector, algorithms, and extends it to arbitrary weighted edit distances.
Abstract: In [6], G. Myers describes a bit-vector algorithm to compute the edit distance between strings. The algorithm converts an input sequence to an output sequence in a parallel way, using bit operations readily available in processors. In this paper, we generalize the technique, and characterize a class of automata for which there exists equivalent parallel, or vector, algorithms. As an application, we extend Myers result to arbitrary weighted edit distances, which are currently used to explore the vast data-bases generated by genetic sequencing.

Journal ArticleDOI
TL;DR: A monoid of strings (words) over a finite alphabet is considered—insertion and deletion of words of arbitrary length and the algorithm for distance calculation, which is polynomial in string length.
Abstract: A monoid of strings (words) over a finite alphabet is considered. The notion of distance on strings is important in the problem of inductive learning related to artificial intelligence, in cryptography, and in some other fields of mathematics. The distance is defined as a minimum length of the transformation path that transforms one string into another. One example is the Levenstein distance, with the transformations being insertions, deletions, and substitutions of letters. A quadratic algorithm for calculating this distance is known to exist. In this paper, a more general case—insertion and deletion of words of arbitrary length—is considered. For this case, the problem of distance calculation turns out to be unsolvable. The basic results of this work are the formulation of the condition of computability of distance and the algorithm for distance calculation, which is polynomial in string length.