Showing papers on "Edit distance published in 2000"

PDF

Open Access

Proceedings Article•

An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research

[...]

Sonja Nießen, Franz Josef Och, Gregor Leusch, Hermann Ney

01 May 2000

TL;DR: This paper defines evaluation criteria which are more adequate than pure edit distance and describes how the measurement along these quality criteria is performed semi-automatically in a fast, convenient and above all consistent way using this tool and the corresponding graphical user interface.

...read moreread less

Abstract: In this paper we present a tool for the evaluation of translation quality. First, the typical requirements of such a tool in the framework of machine translation (MT) research are discussed. We define evaluation criteria which are more adequate than pure edit distance and we describe how the measurement along these quality criteria is performed semi-automatically in a fast, convenient and above all consistent way using our tool and the corresponding graphical user interface.

...read moreread less

318 citations

Journal Article•DOI•

Bayesian graph edit distance

[...]

R. Myers¹, R.C. Wison², Edwin R. Hancock²•Institutions (2)

Praxis¹, University of York²

01 Jun 2000-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The paper develops the idea of edit-distance originally introduced for graph-matching by Sanfeliu and Fu (1983), and shows how the Levenshtein distance (1966) can be used to model the probability distribution for structural errors in the graph- matching problem.

...read moreread less

Abstract: This paper describes a novel framework for comparing and matching corrupted relational graphs. The paper develops the idea of edit-distance originally introduced for graph-matching by Sanfeliu and Fu (1983). We show how the Levenshtein distance (1966) can be used to model the probability distribution for structural errors in the graph-matching problem. This probability distribution is used to locate matches using MAP label updates. We compare the resulting graph-matching algorithm with that recently reported by Wilson and Hancock. The use of edit-distance offers an elegant alternative to the exhaustive compilation of label dictionaries. Moreover, the method is polynomial rather than exponential in its worst-case complexity. We support our approach with an experimental study on synthetic data and illustrate its effectiveness on an uncalibrated stereo correspondence problem. This demonstrates experimentally that the gain in efficiency is not at the expense of quality of match.

...read moreread less

157 citations

Proceedings Article•DOI•

Communication complexity of document exchange

[...]

Graham Cormode¹, Mike Paterson², Süleyman Cenk Sahinalp¹, Uzi Vishkin³•Institutions (3)

Case Western Reserve University¹, University of Warwick², University of Maryland, College Park³

01 Feb 2000

TL;DR: The goal is to design communication protocols with the main objective of minimizing the total number of bits they exchange; other objectives are minimizing the number of rounds and the complexity of internal computations.

...read moreread less

Abstract: We have two users, A and B, who hold documents x and y respectively. Neither of the users has any information about the other''s document. They exchange messages so that B computes x; it may be required that A compute y as well. Our goal is to design communication protocols with the main objective of minimizing the total number of bits they exchange; other objectives are minimizing the number of rounds and the complexity of internal computations. An important notion which determines the efficiency of the protocols is how one measures the distance between x and y. We consider several metrics for measuring this distance, namely the Hamming metric, the Levenshtein metric (edit distance), and a new LZ metric, which is introduced in this paper. We show how to estimate the distance between x and y using a single message of logarithmic size. For each metric, we present the first communication-efficient protocols, which often match the corresponding lower bounds. A consequence of these are error-correcting codes for these error models which correct up to d errors in n characters using O(d log n) bits. Our most interesting methods use a new histogram transformation that we introduce to convert edit distance to L1 distance.

...read moreread less

137 citations

Proceedings Article•DOI•

ABL: alignment-based learning

[...]

Menno van Zaanen¹•Institutions (1)

University of Leeds¹

31 Jul 2000

TL;DR: A new type of grammar learning algorithm, inspired by string edit distance, that takes a corpus of flat sentences as input and returns a Corpus of labelled, bracketed sentences that works on pairs of unstructured sentences.

...read moreread less

Abstract: This paper introduces a new type of grammar learning algorithm, inspired by string edit distance (Wagner and Fischer, 1974). The algorithm takes a corpus of flat sentences as input and returns a corpus of labelled, bracketed sentences. The method works on pairs of unstructured sentences that have one or more words in common. When two sentences are divided into parts that are the same in both sentences and parts that are different, this information is used to find parts that are interchangeable. These parts are taken as possible constituents of the same type. After this alignment learning step, the selection learning step selects the most probable constituents from all possible constituents.This method was used to bootstrap structure on the ATIS corpus (Marcus et. al., 1993) and on the OVIS! corpus (Bonnema et al., 1997). While the results are encouraging (we obtained up to 89.25% non-crossing brackets precision), this paper will point out some of the shortcomings of our approach and will suggest possible solutions.

...read moreread less

114 citations

Journal Article•DOI•

Pattern Matching in Hypertext

[...]

Amihood Amir¹, Moshe Lewenstein¹, Noa Lewenstein¹•Institutions (1)

Bar-Ilan University¹

01 Apr 2000-Journal of Algorithms

TL;DR: It is shown that in contrast to regular text, it does make a difference whether the errors occur in the hypertext or the pattern, and a much simpler algorithm is presented achieving the same complexity which runs on any hypertext graph.

...read moreread less

60 citations

Journal Article•DOI•

Comparison and Classification of Documents Based on Layout Similarity

[...]

Jianying Hu¹, Ramanujan S. Kashi¹, Gordon Wilfong¹•Institutions (1)

Alcatel-Lucent¹

01 May 2000-Information Retrieval

TL;DR: The usefulness of the features derived from interval coding in a hidden Markov model based page layout classification system that is trainable and extendible are demonstrated.

...read moreread less

Abstract: This paper describes features and methods for document image comparison and classification at the spatial layout level. The methods are useful for visual similarity based document retrieval as well as fast algorithms for initial document type classification without OCR. A novel feature set called interval encoding is introduced to capture elements of spatial layout. This feature set encodes region layout information in fixed-length vectors by capturing structural characteristics of the image. These fixed-length vectors are then compared to each other through a Manhattan distance computation for fast page layout comparison. The paper describes experiments and results to rank-order a set of document pages in terms of their layout similarity to a test document. We also demonstrate the usefulness of the features derived from interval coding in a hidden Markov model based page layout classification system that is trainable and extendible. The methods described in the paper can be used in various document retrieval tasks including visual similarity based retrieval, categorization and information extraction.

...read moreread less

59 citations

Book Chapter•DOI•

Wrapper Generation via Grammar Induction

[...]

Boris Chidlovskii¹, Jon Ragetli², Maarten de Rijke²•Institutions (2)

Xerox¹, University of Amsterdam²

31 May 2000

TL;DR: This work presents an approach to automatically create wrappers by means of an incremental grammar induction algorithm that uses an adaptation of the string edit distance to create such wrappers.

...read moreread less

Abstract: To facilitate effective search on the World Wide Web, meta search engines have been developed which do not search the Web themselves, but use available search engines to find the required information. By means of wrappers, meta search engines retrieve information from the pages returned by search engines. We present an approach to automatically create such wrappers by means of an incremental grammar induction algorithm. The algorithm uses an adaptation of the string edit distance. Our method performs well; it is quick, can be used for several types of result pages and requires a minimal amount of user interaction.

...read moreread less

45 citations

Proceedings Article•DOI•

Speeding up the computation of the edit distance for cyclic strings

[...]

Andrés Marzal¹, S. Barrachina¹•Institutions (1)

James I University¹

01 Sep 2000

TL;DR: Experimental results with synthetic cyclic strings and a handwritten digits recognition task show that the new algorithm is faster than Maes' and Gregor and Thomason's (1993) algorithms.

...read moreread less

Abstract: A new algorithm to compute the edit distance between cyclic strings is presented. Experimental results with synthetic cyclic strings and a handwritten digits recognition task show that the new algorithm is faster than Maes' (1990) and Gregor and Thomason's (1993) algorithms.

...read moreread less

39 citations

Efficient Algorithms For Normalized Edit Distance

[...]

Abdullah N. Arslan¹, Ömer Eğecioğlu•Institutions (1)

University of California, Santa Barbara¹

01 Jan 2000

TL;DR: This work gives provably better algorithms for normalized edit distance computation with proven complexity bounds: an -time algorithm when the cost function is uniform, i.e, the weights of edit operations depend only on the type but not on the individual symbols involved, and an - time algorithms when the weights are rational.

...read moreread less

Abstract: A common model for computing the similarity of two strings and of lengths and respectively, with , is to transform into through a sequence of edit operations, called an edit sequence. The edit operations are of three types: insertion, deletio n, and substitu- tion. A given cost function assigns a weight to each edit operation. The amortized weight for a given edit sequence is the ratio of its weight to its length, a nd the minimum of this ratio over all edit sequences is the normalized edit distance. Existing algorithms for normalized edit distance computation with proven complexity bounds require time in the worst-case. We give provably better algorithms: an -time algorithm when the cost function is uniform, i.e, the weights of edit operations depend only on the type but not on the individual symbols involved, and an -time algorithm when the weights are rational.

...read moreread less

34 citations

Proceedings Article•DOI•

Approximate string matching algorithms for limited-vocabulary OCR output correction

[...]

Thomas A. Lasko, Susan E. Hauser¹•Institutions (1)

National Institutes of Health¹

21 Dec 2000

TL;DR: Of the five methods for matching words mistranslated by optical character recognition to their most likely match in a reference dictionary, the Bayesian algorithm produced the most correct matches (87%), and had the advantage of producing scores that have a useful and practical interpretation.

...read moreread less

Abstract: Five methods for matching words mistranslated by optical character recognition to their most likely match in a reference dictionary were tested on data from the archives of the National Library of Medicine. The methods, including an adaptation of the cross correlation algorithm, the generic edit distance algorithm, the edit distance algorithm with a probabilistic substitution matrix, Bayesian analysis, and Bayesian analysis on an actively thinned reference dictionary were implemented and their accuracy rates compared. Of the five, the Bayesian algorithm produced the most correct matches (87%), and had the advantage of producing scores that have a useful and practical interpretation.

...read moreread less

30 citations

Patent•

Phonetic distance calculation method for similarity comparison between phonetic transcriptions of foreign words

[...]

Key-Sun Choi¹, Byung-Ju Kang¹•Institutions (1)

KAIST¹

17 Jan 2000

Abstract: A phonetic distance calculation method for similarity comparison between phonetic transcriptions of foreign words. A system manager defines character element transformation patterns occurrable between phonetic transcriptions derived from the same foreign language. A system generates new phonetic transcriptions according to the defined character element transformation patterns and assigns a demerit mark to each of the generated phonetic transcriptions according to a phonetic distance. A minimum phonetic distance between each of the generated phonetic transcriptions and a given phonetic transcription is calculated on the basis of a minimum edit distance calculation method. Any one of the generated phonetic transcriptions with a smallest one of the calculated minimum phonetic distances is determined to be most similar to the given phonetic transcription. Therefore, a document retrieval operation can accurately be performed in a document retrieval system and a document retrieval time can be reduced therein, resulting in a significant improvement in the performance of the document retrieval system.

...read moreread less

Book Chapter•DOI•

A Dynamic Edit Distance Table

[...]

Sung-Ryul Kim¹, Kunsoo Park¹•Institutions (1)

Seoul National University¹

21 Jun 2000

TL;DR: As a solution for the edit distance between A and B, the difference representation of the D-table is defined, which leads to a simple and intuitive algorithm for the incremental/decremental edit distance problem.

...read moreread less

Abstract: In this paper we consider the incremental/decremental version of the edit distance problem: given a solution to the edit distance between two strings A and B, find a solution to the edit distance between A and B′ where B = aB (incremental) or bB′ = B (decremental). As a solution for the edit distance between A and B, we define the difference representation of the D-table, which leads to a simple and intuitive algorithm for the incremental/decremental edit distance problem.

...read moreread less

Proceedings Article•DOI•

A programmable processor for approximate string matching with high throughput rate

[...]

H.-M. Bluthgen, T.G. Noll

10 Jul 2000

TL;DR: The algorithm and architecture of a processor for approximate string matching with high throughput rate is presented, dedicated for multimedia and information retrieval applications working on huge amounts of mass data where short response times are necessary.

...read moreread less

Abstract: In this paper we present the algorithm and architecture of a processor for approximate string matching with high throughput rate. The processor is dedicated for multimedia and information retrieval applications working on huge amounts of mass data where short response times are necessary. The algorithm used for the approximate string matching is based on a dynamic programming procedure known as the string-to-string correction problem. It has been extended to fulfil the requirements of full text search in a database system, including string matching with wildcards and handling of idiomatic turns of some languages. The processor has been fabricated in a 0.6 /spl mu/m CMOS technology. It performs a maximum of 8.5 billion character comparisons per second when operating at the specified clock frequency of 132 MHz.

...read moreread less

Proceedings Article•DOI•

The effects of word order and segmentation on translation retrieval performance

[...]

Timothy Baldwin¹, Hozumi Tanaka¹•Institutions (1)

Tokyo Institute of Technology¹

31 Jul 2000

TL;DR: This research looks at the effects of word order and segmentation on translation retrieval performance for an experimental Japanese-English translation memory system, and indicates that character-based indexing is consistently superior to word- based indexing, suggesting that segmentation is an unnecessary luxury in the given domain.

...read moreread less

Abstract: This research looks at the effects of word order and segmentation on translation retrieval performance for an experimental Japanese-English translation memory system. We implement a number of both bag-of-words and word order-sensitive similarity metrics, and test each over character-based and word-based indexing. The translation retrieval performance of each system configuration is evaluated empirically through the notion of word edit distance between translation candidate outputs and the model translation. Our results indicate that character-based indexing is consistently superior to word-based indexing, suggesting that segmentation is an unnecessary luxury in the given domain. Word order-sensitive approaches are demonstrated to generally outperform bag-of-words methods, with source language segment-level edit distance proving the most effective similarity metric.

...read moreread less

Journal Article•DOI•

Fast, optimal alignment of three sequences using linear gap costs.

[...]

David R. Powell¹, Lloyd Allison¹, Trevor I. Dix¹•Institutions (1)

Monash University¹

07 Dec 2000-Journal of Theoretical Biology

TL;DR: This paper presents a new algorithm guaranteed to find the optimal alignment for three sequences using linear gap costs and uses a speed-up technique based on Ukkonen's greedy algorithm which he presented for two sequences and simple costs.

...read moreread less

Patent•

Two step method for correcting spelling of a word or phrase in a document

[...]

Alexander Birman¹, Harry R. Gail¹, Sidney L. Hantler¹, George B. Leeman¹, Daniel Milch¹ - Show less +1 more•Institutions (1)

IBM¹

20 Sep 2000

TL;DR: In this article, a very fast method for correcting the spelling of a word or phrase in a document proceeds in two steps: first applying a fast approximate method for eliminating most candidate words from consideration (without computing the exact edit distance between the given word whose spelling is to be corrected and any candidate word), followed by a "slow method" which computes the exact editing distance between a given word and each of the few remaining candidate words.

...read moreread less

Abstract: A very fast method for correcting the spelling of a word or phrase in a document proceeds in two steps: first applying a very fast approximate method for eliminating most candidate words from consideration (without computing the exact edit distance between the given word whose spelling is to be corrected and any candidate word), followed by a “slow method” which computes the exact edit distance between the word whose spelling is to be corrected and each of the few remaining candidate words. The combination results in a method that is almost as fast as the fast approximate method and as exact as the slow method.

...read moreread less

Proceedings Article•DOI•

FURY: fuzzy unification and resolution based on edit distance

[...]

David Gilbert¹, Michael Schroeder¹•Institutions (1)

City University London¹

08 Nov 2000

TL;DR: The authors present a theoretically founded framework for fuzzy unification and resolution based on edit distance over trees and develop the FURY system, which implements the framework efficiently using dynamic programming.

...read moreread less

Abstract: The authors present a theoretically founded framework for fuzzy unification and resolution based on edit distance over trees. Their framework extends classical unification and resolution conservatively. They prove important properties of the framework and develop the FURY system, which implements the framework efficiently using dynamic programming. The authors evaluate the framework and system on a large problem in the bioinformatics domain, that of detecting typographical errors in an enzyme name database.

...read moreread less

Patent•

Spelling correction method using improved minimum edit distance algorithm

[...]

Mark Kantrowitz

06 Jan 2000

TL;DR: A computer method of spelling correction comprises the steps of: a) storing a dictionary of valid words, b) checking each input string to identify input strings not in the dictionary, c) generating test words by a restricted set of edit operations which correct the most common errors comprising insertion, deletion, transposition and substitution, d) comparing the edited input string generated in the preceding step with words stored in a dictionary and e) generating a candidate word or candidate list of the words.

...read moreread less

Abstract: A computer method of spelling correction comprises the steps of: a) storing a dictionary of valid words, b) for each input string to be checked comparing the input string to words in the stored dictionary to identify input strings not in the dictionary, c) for each input string not found in the preceding step, generating test words by a restricted set of edit operations which correct the most common errors comprising insertion, deletion, transposition and/or substitution, d) comparing the edited input string generated in the preceding step with words stored in the dictionary and e) generating a candidate word or candidate list of the words.

...read moreread less

Book Chapter•DOI•

Efficient Techniques for a Very Accurate Measurement of Dissimilarities between Cyclic Patterns

[...]

Ramón Alberto Mollineda¹, Enrique Vidal¹, Francisco Casacuberta¹•Institutions (1)

Polytechnic University of Valencia¹

30 Aug 2000-Lecture Notes in Computer Science

TL;DR: Two efficient approximate techniques for measuring dissimilarities between cyclic patterns are presented, inspired on the quadratic time algorithm proposed by Bunke and Buhler, achieving even more accurate solutions.

...read moreread less

Abstract: Two efficient approximate techniques for measuring dissimilarities between cyclic patterns are presented. They are inspired on the quadratic time algorithm proposed by Bunke and Buhler. The first technique completes pseudoalignments built by the Bunke and Buhler algorithm (BBA), obtaining full alignments between cyclic patterns. The edit cost of the minimum-cost alignment is given as an upper-bound estimation of the exact cyclic edit distance, which results in a more accurate bound than the lower one obtained by BBA. The second technique uses both bounds to compute a weighted average, achieving even more accurate solutions. Weights come from minimizing the sum of squared relative errors with respect to exact distance values on a training set of string pairs. Experiments were conducted on both artificial and real data, to demonstrate the capabilities of new techniques in both accurateness and quadratic computing time.

...read moreread less

Book Chapter•DOI•

COFE: A Scalable Method for Feature Extraction from Complex Objects

[...]

Gabriela Hristescu¹, Martin Farach-Colton¹•Institutions (1)

Rutgers University¹

04 Sep 2000

TL;DR: This work proposes COFE, a method for sparse feature extraction which is based on novel random non-linear projections and evaluates Cofe on real data and finds that it performs very well in terms of quality of features extracted, number of distances evaluated,number of database scans performed and total run time.

...read moreread less

Abstract: Feature Extraction, also known as Multidimensional Scaling, is a basic primitive associated with indexing, clustering, nearest neighbor searching and visualization. We consider the problem of feature extraction when the data-points are complex and the distance evaluation function is very expensive to evaluate. Examples of expensive distance evaluations include those for computing the Hausdorff distance between polygons in a spatial database, or the edit distance between macromolecules in a DNA or protein database. We propose COFE, a method for sparse feature extraction which is based on novel random non-linear projections. We evaluate Cofe on real data and find that it performs very well in terms of quality of features extracted, number of distances evaluated, number of database scans performed and total run time. We further propose COFE-GR, which matches Cofe in terms of distance evaluations and run-time, but outperforms it in terms of quality of features extracted.

...read moreread less

Journal Article•

COFE : A scalable method for feature extraction from complex objects

[...]

Gabriela Hristescu, Martin Farach-Colton

01 Jan 2000-Lecture Notes in Computer Science

TL;DR: In this paper, the authors proposed COFE, a method for sparse feature extraction which is based on novel random non-linear projections, and evaluated it on real data and found that it performs very well in terms of quality of features extracted, number of distances evaluated, the number of database scans performed and total run time.

...read moreread less

Abstract: Feature Extraction, also known as Multidimensional Scaling, is a basic primitive associated with indexing, clustering, nearest neighbor searching and visualization. We consider the problem of feature extraction when the data-points are complex and the distance evaluation function is very expensive to evaluate. Examples of expensive distance evaluations include those for computing the Hausdorff distance between polygons in a spatial database, or the edit distance between macromolecules in a DNA or protein database. We propose COFE, a method for sparse feature extraction which is based on novel random non-linear projections. We evaluate COFE on real data and find that it performs very well in terms of quality of features extracted, number of distances evaluated, number of database scans performed and total run time. We further propose COFE-GR, which matches COFE in terms of distance evaluations and run-time, but outperforms it in terms of quality of features extracted.

...read moreread less

Book Chapter•DOI•

Computing Shortest Networks with Fixed Topologies

[...]

Tao Jiang¹, Lusheng Wang²•Institutions (2)

McMaster University¹, City University of Hong Kong²

01 Jan 2000

TL;DR: It is demonstrated that the problem of computing a shortest network interconnecting a set of points under a fixed tree topology is polynomial time solvable for some spaces and NP-hard for the others.

...read moreread less

Abstract: We discuss the problem of computing a shortest network interconnecting a set of points under a fixed tree topology, and survey the recent algorithmic and complexity results in the literature covering a wide range of metric spaces, including Euclidean, rectilinear, space of sequences with Hamming and edit distances, communication networks, etc It is demonstrated that the problem is polynomial time solvable for some spaces and NP-hard for the others When the problem is NPhard, we attempt to give approximation algorithms with guaranteed relative errors

...read moreread less

Book Chapter•DOI•

Using Suffix Trees for Gapped Motif Discovery

[...]

Emily Rocke¹•Institutions (1)

University of Washington¹

21 Jun 2000

TL;DR: This paper presents a novel method for using suffix trees to greatly improve the performance of the Gibbs sampling approach.

...read moreread less

Abstract: Gibbs sampling is a local search method that can be used to find novel motifs in a text string. In previous work [8], we have proposed a modified Gibbs sampler that can discover novel gapped motifs of varying lengths and occurrence rates in DNA or protein sequences. The Gibbs sampling method requires repeated searching of the text for the best match to a constantly evolving collection of aligned strings, and each search pass previously required θ(nl) time, where l is the length of the motif and n the length of the original sequence. This paper presents a novel method for using suffix trees to greatly improve the performance of the Gibbs sampling approach.

...read moreread less

Book Chapter•DOI•

Efficient Alignment and Correspondence Using Edit Distance

[...]

Paolo Bergamini¹, Luigi Cinque¹, Andrew D. J. Cross², Edwin R. Hancock², Stefano Levialdi¹, Richard Myers² - Show less +2 more•Institutions (2)

Sapienza University of Rome¹, University of York²

30 Aug 2000-Lecture Notes in Computer Science

TL;DR: This paper shows how graph edit-distance can be used to compute the correspondence probabilities more efficiently and shows that the edit distance method is not only more efficient, but also more accurate than the dictionary-based method.

...read moreread less

Abstract: This paper presents work aimed at rendering the dual-step EM algorithm of Cross and Hancock more efficient The original algorithm integrates the processes of point-set alignment and correspondence The consistency of the pattern of correspondence matches on the Delaunay triangulation of the points is used to gate contributions to the expected log-likelihood function for point-set alignment parameters However, in its original form the algorithm uses a dictionary of structure-preserving mappings to asses the consistency of match This proves to be a serious computational bottleneck In this paper, we show how graph edit-distance can be used to compute the correspondence probabilities more efficiently In a sensitivity analysis, we show that the edit distance method is not only more efficient, it is also more accurate than the dictionary-based method

...read moreread less

Proceedings Article•DOI•

Estimation of probabilities for edit operations

[...]

A. Weigel¹, T. Jager, A. Pies•Institutions (1)

German Research Centre for Artificial Intelligence¹

03 Sep 2000

TL;DR: In this work, as an alternative for initial estimation of edit costs, character confusion probabilities are discussed in the context of edit distances and it is shown how improved estimations for them can be achieved.

...read moreread less

Abstract: In this work, as an alternative for initial estimation of edit costs, character confusion probabilities are discussed in the context of edit distances. Thereby, insertions have to be handled carefully and it is shown how improved estimations for them can be achieved. Furthermore, some of the proposed solutions based on joint events leading to inferior models for retrieving the word corresponding to the recognized string at hand from a given lexicon, are discussed.

...read moreread less

Proceedings Article•DOI•

Using lexical similarity in handwritten word recognition

[...]

Jaehwa Park¹, Venu Govindaraju•Institutions (1)

State University of New York System¹

13 Jun 2000

TL;DR: A method to capture lexical similarity of a lexicon and reliability of a character recognizer which serve to capture the dynamism of the environment.

...read moreread less

Abstract: Recognition using only visual evidence cannot always be successful due to limitations of information and resources available during training. Considering relation among lexicon entries is sometimes useful for decision making. In this paper we present a method to capture lexical similarity of a lexicon and reliability of a character recognizer which serve to capture the dynamism of the environment. A parameter, lexical similarity, is defined by measuring these two factors as edit distance between lexicon entries and separability of each character's recognition results. Our experiments show that a utility function considering lexical similarity in a decision stage can enhance the performance of a conventional word recognizer.

...read moreread less

Journal Article•

Approximate string matching for stroke direction and pressure sequences

[...]

Sung-Hyuk Cha, Yong-Chul Shin, Sargur N. Srihari

01 Jan 2000-Scopus

TL;DR: This study considers stroke direction and pressure sequence strings of a character as character level image signatures for writer identification and presents the newly defined and modified edit distances depending upon their measurement types.

...read moreread less

Abstract: The problem of Writer Identification based on similarity is formalized by defining a distance between character or word level features and finding the most similar writings or all writings which are within a certain threshold distance. Among many features, we consider stroke direction and pressure sequence strings of a character as character level image signatures for writer identification. As the conventional definition of edit distance is not applicable in essence, we present the newly defined and modified edit distances depending upon their measurement types. Finally, we present a prototype stroke directional and pressure sequence string extractor used on the writer identification. The importance of this study is the attempt to give a definition of distance between two characters based on the two types of strings.

...read moreread less

Book•

Combinatorial Pattern Matching: 11th Annual Symposium. CPM 2000, Montreal, Canada, June 21-23, 2000, Proceedings

[...]

Raffaele Giancarlo, David Sankoff

07 Jun 2000

TL;DR: Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts and Periods and Quasiperiods Characterization are studied.

...read moreread less

Abstract: Invited Lectures.- Identifying and Filtering Near-Duplicate Documents.- Machine Learning for Efficient Natural-Language Processing.- Browsing around a Digital Library: Today and Tomorrow.- Summer School Lectures.- Algorithmic Aspects of Speech Recognition: A Synopsis.- Some Results on Flexible-Pattern Discovery.- Contributed Papers.- Explaining and Controlling Ambiguity in Dynamic Programming.- A Dynamic Edit Distance Table.- Parametric Multiple Sequence Alignment and Phylogeny Construction.- Tsukuba BB: A Branch and Bound Algorithm for Local Multiple Sequence Alignment.- A Polynomial Time Approximation Scheme for the Closest Substring Problem.- Approximation Algorithms for Hamming Clustering Problems.- Approximating the Maximum Isomorphic Agreement Subtree Is Hard.- A Faster and Unifying Algorithm for Comparing Trees.- Incomplete Directed Perfect Phylogeny.- The Longest Common Subsequence Problem for Arc-Annotated Sequences.- Boyer-Moore String Matching over Ziv-Lempel Compressed Text.- A Boyer-Moore Type Algorithm for Compressed Pattern Matching.- Approximate String Matching over Ziv-Lempel Compressed Text.- Improving Static Compression Schemes by Alphabet Extension.- Genome Rearrangement by Reversals and Insertions/Deletions of Contiguous Segments.- A Lower Bound for the Breakpoint Phylogeny Problem.- Structural Properties and Tractability Results for Linear Synteny.- Shift Error Detection in Standardized Exams.- An Upper Bound for Number of Contacts in the HP-Model on the Face-Centered-Cubic Lattice (FCC).- The Combinatorial Partitioning Method.- Compact Suffix Array.- Linear Bidirectional On-Line Construction of Affix Trees.- Using Suffix Trees for Gapped Motif Discovery.- Indexing Text with Approximate q-Grams.- Simple Optimal String Matching Algorithm.- Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts.- Periods and Quasiperiods Characterization.- Finding Maximal Quasiperiodicities in Strings.- On the Complexity of Determining the Period of a String.

...read moreread less

Book Chapter•DOI•

Fast Implementations of Automata Computations

[...]

Anne Bergeron¹, Sylvie Hamel¹•Institutions (1)

Université du Québec¹

24 Jul 2000

TL;DR: This paper generalizes the Myers result, and characterize a class of automata for which there exists equivalent parallel, or vector, algorithms, and extends it to arbitrary weighted edit distances.

...read moreread less

Abstract: In [6], G. Myers describes a bit-vector algorithm to compute the edit distance between strings. The algorithm converts an input sequence to an output sequence in a parallel way, using bit operations readily available in processors. In this paper, we generalize the technique, and characterize a class of automata for which there exists equivalent parallel, or vector, algorithms. As an application, we extend Myers result to arbitrary weighted edit distances, which are currently used to explore the vast data-bases generated by genetic sequencing.

...read moreread less

Journal Article•DOI•

Distance calculation on strings

[...]

O. D. Golubitskii¹•Institutions (1)

Moscow State University¹

01 Mar 2000-Programming and Computer Software

TL;DR: A monoid of strings (words) over a finite alphabet is considered—insertion and deletion of words of arbitrary length and the algorithm for distance calculation, which is polynomial in string length.

...read moreread less

Abstract: A monoid of strings (words) over a finite alphabet is considered. The notion of distance on strings is important in the problem of inductive learning related to artificial intelligence, in cryptography, and in some other fields of mathematics. The distance is defined as a minimum length of the transformation path that transforms one string into another. One example is the Levenstein distance, with the transformations being insertions, deletions, and substitutions of letters. A quadratic algorithm for calculating this distance is known to exist. In this paper, a more general case—insertion and deletion of words of arbitrary length—is considered. For this case, the problem of distance calculation turns out to be unsolvable. The basic results of this work are the formulation of the condition of computability of distance and the algorithm for distance calculation, which is polynomial in string length.

...read moreread less