Showing papers on "Edit distance published in 1999"

PDF

Open Access

Journal Article•DOI•

Topology of strings: median string is NP-complete

[...]

06 Dec 1999-Theoretical Computer Science

TL;DR: It is proved that computing the median string corresponds to a NP-complete decision problems, thus proving that this problem is NP-hard.

...read moreread less

137 citations

Communication Complexity of Document Exchange

[...]

Graham Cormode¹, Mike Paterson², Süleyman Cenk Sahinalp¹, Uzi Vishkin³•Institutions (3)

Case Western Reserve University¹, University of Warwick², University of Maryland, College Park³

02 Aug 1999

TL;DR: In this paper, the authors consider the problem of minimizing the total number of bits exchanged between two users, while minimizing the number of rounds and the complexity of internal computations, and show how to estimate the distance between x and y using a single message of logarithmic size.

...read moreread less

Abstract: We have two users, A and B, who hold documents x and y respectively. Neither of the users has any information about the other''s document. They exchange messages so that B computes x; it may be required that A compute y as well. Our goal is to design communication protocols with the main objective of minimizing the total number of bits they exchange; other objectives are minimizing the number of rounds and the complexity of internal computations. An important notion which determines the efficiency of the protocols is how one measures the distance between x and y. We consider several metrics for measuring this distance, namely the Hamming metric, the Levenshtein metric (edit distance), and a new LZ metric, which is introduced in this paper. We show how to estimate the distance between x and y using a single message of logarithmic size. For each metric, we present the first communication-efficient protocols, which often match the corresponding lower bounds. A consequence of these are error-correcting codes for these error models which correct up to d errors in n characters using O(d log n) bits. Our most interesting methods use a new histogram transformation that we introduce to convert edit distance to L1 distance.

...read moreread less

133 citations

Patent•

A search system and method for retrieval of data, and the use thereof in a search engine

[...]

Knut Magne Risvik

09 Jul 1999

TL;DR: A search system for information retrieval includes a data structure in the form of a non-evenly spaced sparse suffix tree for storing suffixes of words and/or symbols, or sequences thereof, in a text T and a query Q.

...read moreread less

Abstract: A search system for information retrieval includes a data structure in the form of a non-evenly spaced sparse suffix tree for storing suffixes of words and/or symbols, or sequences thereof, in a text T, a metric M including combined edit distance metrics for an approximate degree of matching respectively between words and/or symbols, or between sequences thereof, in the text T and a query Q, the latter distance metric including weighting cost functions for edit operations which transform a sequence S of the text into a sequence P of the query Q, and search algorithms for determining the degree of matching respectively between words and/or symbols, or between sequences thereof, in respectively the text T and the query Q, such that information R is retrieved with a specified degree of matching with the query Q Optionally the search system also includes algorithms for determining exact matching such that information R may be retrieved with an exact degree of matching with the query Q

...read moreread less

128 citations

Proceedings Article•DOI•

Bayesian graph edit distance

[...]

R. Myers¹, Richard C. Wilson, Edwin R. Hancock•Institutions (1)

Universities UK¹

27 Sep 1999

TL;DR: The normalised edit distance of Marzal and Vidal (1993) can be used to model the probability distribution for structural errors in the graph-matching problem, and this probability distribution is used to locate matches using MAP label updates.

...read moreread less

Abstract: This paper describes a novel framework for comparing and matching corrupted relational graphs. The normalised edit distance of Marzal and Vidal (1993) can be used to model the probability distribution for structural errors in the graph-matching problem. This probability distribution is used to locate matches using MAP label updates. We compare this criterion with that recently reported by Wilson and Hancock (1997). The use of edit distance offers an elegant alternative to the exhaustive compilation of label dictionaries. Moreover the method is polynomial rather than exponential in its worst-case complexity. We support our approach with an experimental study on synthetic data, and illustrate its effectiveness on an uncalibrated stereo correspondence problem.

...read moreread less

126 citations

Journal Article•DOI•

A distance measure for video sequences

[...]

Donald A. Adjeroh¹, Moon-Chuen Lee¹, Irwin King¹•Institutions (1)

The Chinese University of Hong Kong¹

01 Jul 1999-Computer Vision and Image Understanding

TL;DR: The problem of video sequence-to-sequence matching as a pattern-matching problem is formulated and the vstring edit distance is proposed as a suitable distance measure for video sequences.

...read moreread less

75 citations

Journal Article•DOI•

Matching for Run-Length Encoded Strings

[...]

Alberto Apostolico¹, Gad M. Landau², Steven Skiena³•Institutions (3)

University of Padua¹, University of Haifa², State University of New York System³

01 Mar 1999-Journal of Complexity

TL;DR: This paper develops signiicantly faster algorithms for a special class of strings which emerge frequently in pattern matching problems, and compares this run-length encoded string against the ith row or column of each of the character image-models.

...read moreread less

71 citations

Finding color and shape patterns in images

[...]

Leonidas J. Guibas, Scott Cohen

01 Jan 1999

TL;DR: This thesis gives a couple of modifications which make the EMD more amenable to partial matching, including the partial EMD in which only a given fraction of the weight in one distribution is forced to match weight in the other, and presents algorithms that are guaranteed to find a globally optimal transformation when matching equal-weight distributions under translation.

...read moreread less

Abstract: This thesis is devoted to the Earth Mover's Distance and its use within content-based image retrieval (CBIR). The major CBIR problem discussed is the pattern problem: Given an image and a query pattern, determine if the image contains a region which is visually similar to the pattern; if so, find at least one such image region. The Earth Mover's Distance (EMD) is an edit distance between distributions that allows for partial matching, and which has many applications in CBIR. We give a couple of modifications which make the EMD more amenable to partial matching, including the partial EMD in which only a given fraction of the weight in one distribution is forced to match weight in the other. An important issue addressed in this thesis is the use of efficient, effective lower bounds on the EMD to speed up retrieval times. We contribute lower bounds that are applicable in the partial matching case in which distributions do not have the same total weight. Another important problem in CBIR is the EMD under transformation (EMD G ) problem: find a transformation of one distribution which minimizes its EMD to another, where the set of allowable transformations G is given. The problem of estimating the size/scale at which a pattern occurs in an image is phrased and efficiently solved as an EMD G problem. For EMD G problems with transformations that modify the points of a distribution but not its weights, we present a very general, monotonically convergent iteration called the FT iteration. This iteration may, however, converge to only a locally optimal EMD value and transformation. We also present algorithms that are guaranteed to find a globally optimal transformation when matching equal-weight distributions under translation. Our pattern problem solution is the SEDL (Scale Estimation for Directed Location) content-based image retrieval system. Three important contributions of this system are (1) a general framework for finding both color and shape patterns, (2) the previously mentioned novel scale estimation algorithm using the EMD, and (3) a directed (as opposed to exhaustive) search strategy. We show that SEDL achieves excellent results for the color pattern problem on a database of product advertisements, and the shape pattern problem on a database of Chinese characters.

...read moreread less

64 citations

Journal Article•DOI•

A simple algorithm for detecting circular permutations in proteins

[...]

Shai Uliel¹, Amit Fliess, Amihood Amir, Ron Unger¹•Institutions (1)

Bar-Ilan University¹

01 Nov 1999

TL;DR: A simple and efficient algorithm that runs in time N2 is presented, based on duplicating one of the two sequences, and then performing a modified version of the standard dynamic programming algorithm, that performs very well.

...read moreread less

Abstract: Motivation: Circular permutation of a protein is a genetic operation in which part of the C-terminal of the protein is moved to its N-terminal. Recently, it has been shown that proteins that undergo engineered circular permutations generally maintain their three dimensional structure and biological function. This observation raises the possibility that circular permutation has occured in Nature during evolution. In this scenario a protein underwent circular permutation into another protein, thereafter both proteins further diverged by standard genetic operations. To study this possibility one needs an efficient algorithm that for a given pair of proteins can detect the underlying event of circular permutations. A possible formal description of the question is: given two sequences, find a circular permutation of one of them under which the edit distance between the proteins is minimal. A naive algorithm might take time proportional to N 3 or even N 4 , which is prohibitively slow for a large-scale survey. A sophisticated algorithm that runs in asymptotic time of N 2 was recently suggested, but it is not practical for a large-scale survey. Results: A simple and efficient algorithm that runs in time N 2 is presented. The algorithm is based on duplicating one of the two sequences, and then performing a modified version of the standard dynamic programming algorithm. While the algorithm is not guaranteed to find the optimal results, we present data that indicate that in practice the algorithm performs very well. Availability: A Fortran program that calculates the optimal edit distance under circular permutation is available upon request from the authors.

...read moreread less

55 citations

Book Chapter•DOI•

A New Indexing Method for Approximate String Matching

[...]

Gonzalo Navarro¹, Ricardo Baeza-Yates¹•Institutions (1)

University of Chile¹

22 Jul 1999

TL;DR: A new indexing method based on a suffix tree combined with a partitioning of the pattern that outperforms by far all other algorithms for indexed approximate searching, and it is shown how this index can be implemented using much less space.

...read moreread less

Abstract: We present a new indexing method for the approximate string matching problem. The method is based on a suffix tree combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the retrieval time is O(nλ), for 0 < λ < 1, whenever α < 1-e/√σ, where α is the error level tolerated and σ is the alphabet size. We experimentally show that this index outperforms by far all other algorithms for indexed approximate searching, also being the first experiments that compare the different existing schemes. We finally show how this index can be implemented using much less space.

...read moreread less

43 citations

Proceedings Article•DOI•

Approximate stroke sequence string matching algorithm for character recognition and analysis

[...]

Sung-Hyuk Cha¹, Yong-Chul Shin², Sargur N. Srihari²•Institutions (2)

State University of New York System¹, University at Buffalo²

20 Sep 1999

TL;DR: The proposed method converts a two-dimensional image into a one-dimensional string and computes the edit distance by the modified approximate string matching algorithm and presents the details of applications in handwriting analysis and both online and offline character recognition.

...read moreread less

Abstract: Given two character images, we would like to measure their similarity or difference. Such a similarity or difference measure facilitates the solution to character recognition and handwriting analysis problems. There is, however, no universal definition for similarity measure satisfying a wide range of characteristics such as the slant, deformation or other invariant constraints. For this reason, we propose a new definition for the character similarity measure. First, the proposed method converts a two-dimensional image into a one-dimensional string. Next, it computes the edit distance by the modified approximate string matching algorithm. We describe how to extract the string information and compute the distance and then present the details of applications in handwriting analysis and both online and offline character recognition.

...read moreread less

27 citations

Proceedings Article•DOI•

An efficient uniform-cost normalized edit distance algorithm

[...]

Abdullah N. Arslan¹, Ömer Eğecioğlu•Institutions (1)

University of California, Santa Barbara¹

24 Apr 1999

TL;DR: An O(mn log n)-time algorithm for the problem of normalized edit distance computation when the cost function is uniform, except substitutions can have different weights depending on whether they are matching or non-matching.

...read moreread less

Abstract: A common model for computing the similarity of two strings X and Y of lengths m, and n respectively with m/spl ges/n, is to transform X into Y through a sequence of three types of edit operations: insertion, deletion, and substitution. The model assumes a given cost function which assigns a non-negative real weight to each edit operation. The amortized weight for a given edit sequence is the ratio of its weight to its length, and the minimum of this ratio over all edit sequences is the normalized edit distance. Existing algorithms for normalized edit distance computation with proven complexity bounds require O(mn/sup 2/) time in the worst-case. We give an O(mn log n)-time algorithm for the problem when the cost function is uniform, i.e., the weight of each edit operation is constant within the same type, except substitutions can have different weights depending on whether they are matching or non-matching.

...read moreread less

Book Chapter•DOI•

GESTALT: Genomic Steiner Alignments

[...]

Giuseppe Lancia¹, R. Ravi²•Institutions (2)

University of Padua¹, Carnegie Mellon University²

22 Jul 1999

TL;DR: GESTALT (GEnomic sequences STeiner ALignmenT), a public-domain suite of programs for generating multiple alignments of a set of biosequences, is described, which includes the traditional space-saving ideas of Hirschberg as well as new data-packing techniques.

...read moreread less

Abstract: We describe GESTALT (GEnomic sequences STeiner ALignmenT), a public-domain suite of programs for generating multiple alignments of a set of biosequences.We allow the use of either of the two popular objectives, Tree Alignment or Sum-of-Pairs. The main distinguishing feature of our method is that the alignment is obtained via a tree in which the internal nodes (ancestors) are labeled by Steiner sequences for triples of the input sequences. Given lists of candidate labels for the ancestral sequences, we use dynamic programming to choose an optimal labeling under either objective function. Finally, the fully labeled tree of sequences is turned into into a multiple alignment. Enhancements in our implementation include the traditional space-saving ideas of Hirschberg as well as new data-packing techniques. The running-time bottleneck of computing exact Steiner sequences is handled by a highly effective but much faster heuristic alternative. Finally, other modules in the suite allow automatic generation of linear-program input files that can be used to compute new lower bounds on the optimal values. We also report on some preliminary computational experiments with GESTALT.

...read moreread less

Journal Article•DOI•

A versatile divide and conquer technique for optimal string alignment

[...]

David R. Powell¹, Lloyd Allison¹, Trevor I. Dix¹•Institutions (1)

Monash University, Clayton campus¹

14 May 1999-Information Processing Letters

TL;DR: This new technique has several advantages over previous methods for determining alignments in linear space, such as: simplicity, the ability to use essentially the same technique when using different cost functions, and the practical advantage of easily being able to trade available memory for running time.

...read moreread less

Journal Article•DOI•

Efficient Algorithms for Approximate String Matching with Swaps

[...]

Dong Kyue Kim¹, Jee-Soo Lee², Kunsoo Park¹, Yookun Cho¹•Institutions (2)

Seoul National University¹, Korea National Open University²

01 Mar 1999-Journal of Complexity

TL;DR: This paper includes the swapoperation that interchanges two adjacent characters into the set of allowable edit operations, and presents anO(tmin(m,n))-time algorithm for the extended edit distance problem, where tmin represents the edit distance between the given strings, and n represents the extendedk-differences problem.

...read moreread less

Journal Article•DOI•

Identifying approximately common substructures in trees based on a restricted edit distance

[...]

Jason T. L. Wang¹, Kaizhong Zhang², Chia-Yo Chang¹•Institutions (2)

New Jersey Institute of Technology¹, University of Western Ontario²

01 Dec 1999-Information Sciences

TL;DR: The proposed algorithm solves the problem in time O(d2×T1|×|T2|), which is as fast as the best known algorithm for calculating Selkow's distance of two trees when the distance allowed in the common substructures is a constant independent of the input trees.

...read moreread less

Proceedings Article•DOI•

Region-based modeling and tree edit distance as a basis for gesture recognition

[...]

J. Bellando¹, R. Kothari•Institutions (1)

University of Cincinnati¹

27 Sep 1999

TL;DR: A method which represents the shape of a hand as the union of a set of identically parameterized regions that is based on minimizing the edit distance between the model and the model obtained from image frames acquired at successive time instants is proposed.

...read moreread less

Abstract: Towards developing a gesture-based interface for human computer interaction (HCI), we propose a method which represents the shape of a hand as the union of a set of identically parameterized (disk-shaped) regions. Connectivity between the disk shaped regions is then imposed to allow for the accurate modeling of the hand. Signatures for gestures are then generated based on minimizing the edit distance between the model (consisting of both the disks and a connectivity-preserving tree structure) obtained from image frames acquired at successive time instants.

...read moreread less

Proceedings Article•DOI•

Cross-domain approximate string matching

[...]

Daniel P. Lopresti¹, Gordon Wilfong•Institutions (1)

Bell Labs¹

21 Sep 1999

TL;DR: This work introduces a new formalism for a class of applications that takes two strings as input, each specified in terms of a particular domain, and performs a comparison motivated by constraints derived from a third, possibly different domain.

...read moreread less

Abstract: Approximate string matching is an important paradigm in domains ranging from speech recognition to information retrieval and molecular biology. We introduce a new formalism for a class of applications that takes two strings as input, each specified in terms of a particular domain, and performs a comparison motivated by constraints derived from a third, possibly different domain. This issue arises, for example, when searching multimedia databases built using imperfect recognition technologies (e.g., speech, optical character, and handwriting recognition). We present a polynomial time algorithm for solving the problem, and describe several variations that can also be solved efficiently.

...read moreread less

Book Chapter•DOI•

Approximate Periods of Strings

[...]

Jeong Seop Sim¹, Costas S. Iliopoulos², Costas S. Iliopoulos³, Kunsoo Park¹, William F. Smyth⁴, William F. Smyth³ - Show less +2 more•Institutions (4)

Seoul National University¹, King's College London², Curtin University³, McMaster University⁴

22 Jul 1999

TL;DR: The study of approximately periodic strings is relevant to diverse applications such as molecular biology, data compression, and computer-assisted music analysis and it is shown that the third problem is NP-complete.

...read moreread less

Abstract: The study of approximately periodic strings is relevant to diverse applications such as molecular biology, data compression, and computer-assisted music analysis. Here we study different forms of approximate periodicity under a variety of distance rules.We consider three related problems, for two of which we derive polynomial-time algorithms; we then show that the third problem is NP-complete.

...read moreread less

Book Chapter•DOI•

Applying an Edit Distance to the Matching of Tree Ring Sequences in Dendrochronology

[...]

Carola Wenk¹•Institutions (1)

Free University of Berlin¹

22 Jul 1999

TL;DR: This work introduces a O(α2mn + α4(m + n) algorithm for dating such a sample sequence against an error-free master sequence, where n and m are the lengths of the sequences.

...read moreread less

Abstract: In dendrochronology wood samples are dated according to the tree rings they contain. The dating process consists of comparing the sequence of tree ring widths in the sample to a dated master sequence. Assuming that a tree forms exactly one ring per year a simple sliding algorithm solves this matching task. But sometimes a tree produces no ring or even two rings in a year. If a sample sequence contains this kind of inconsistencies it cannot be dated correctly by the simple sliding algorithm. We therefore introduce a O(α2mn + α4(m + n)) algorithm for dating such a sample sequence against an error-free master sequence, where n and m are the lengths of the sequences. Our algorithm takes into account that the sample might contain up to α missing or double rings and suggests possible positions for these kind of inconsistencies. This is done by employing an edit distance as the distance measure.

...read moreread less

Journal Article•

A graph-edit algorithm for hand-drawn graphical document recognition and thier automatic introduction into CAD systems

[...]

J. Lladós, E. Martí

01 Jan 1999-Machine graphics & vision

Proceedings Article•DOI•

Approximate string matching for stroke direction and pressure sequences

[...]

Sung-Hyuk Cha¹, Yong-Chul Shin¹, Sargur N. Srihari¹•Institutions (1)

University at Buffalo¹

22 Dec 1999

TL;DR: In this article, the problem of writer identification based on similarity is formalized by defining a distance between character or word level features and finding the most similar writings or all writings which are within a certain threshold distance.

...read moreread less

Abstract: The problem of Writer Identification based on similarity is formalized by defining a distance between character or word level features and finding the most similar writings or all writings which are within a certain threshold distance. Among many features, we consider stroke direction and pressure sequence strings of a character as character level image signatures for writer identification. As the conventional definition of edit distance is not applicable in essence, we present the newly defined and modified edit distances depending upon their measurement types. Finally, we present a prototype stroke directional and pressure sequence string extractor used on the writer identification. The importance of this study is the attempt to give a definition of distance between two characters based on the two types of strings.

...read moreread less

Book•

Combinatorial pattern matching : 10th Annual Symposium, CPM 99, Warwick University, UK, July 22-24, 1999 : proceedings

[...]

Maxime Crochemore, Michael S. Paterson

01 Jan 1999

TL;DR: A General Practical Approach to Pattern Matching over Ziv-Lempel Compressed Text and a Dynamic Data Structure for Reverse Lexicographically Sorted Prefixes for Approximate String Matching.

...read moreread less

Abstract: Shift-And Approach to Pattern Matching in LZW Compressed Text.- A General Practical Approach to Pattern Matching over Ziv-Lempel Compressed Text.- Pattern Matching in Text Compressed by Using Antidictionaries.- On the Structure of Syntenic Distance.- Physical Mapping with Repeated Probes: The Hypergraph Superstring Problem.- Hybridization and Genome Rearrangement.- On the Complexity of Positional Sequencing by Hybridization.- GESTALT: Genomic Steiner Alignments.- Bounds on the Number of String Subsequences.- Approximate Periods of Strings.- Finding Maximal Pairs with Bounded Gap.- A Dynamic Data Structure for Reverse Lexicographically Sorted Prefixes.- A New Indexing Method for Approximate String Matching.- The Compression of Subsegments of Images Described by Finite Automata.- Ziv Lempel Compression of Huge Natural Language Data Tries Using Suffix Arrays.- Matching of Spots in 2D Electrophoresis Images. Point Matching Under Non-uniform Distortions.- Applying an Edit Distance to the Matching of Tree Ring Sequences in Dendrochronology.- Fast Multi-dimensional Approximate Pattern Matching.- Finding Common RNA Secondary Structures from RNA Sequences.- Finding Common Subsequences with Arcs and Pseudoknots.- Computing Similarity between RNA Structures.

...read moreread less