scispace - formally typeset
Search or ask a question

Showing papers on "Edit distance published in 1999"


Journal ArticleDOI
TL;DR: It is proved that computing the median string corresponds to a NP-complete decision problems, thus proving that this problem is NP-hard.

137 citations


02 Aug 1999
TL;DR: In this paper, the authors consider the problem of minimizing the total number of bits exchanged between two users, while minimizing the number of rounds and the complexity of internal computations, and show how to estimate the distance between x and y using a single message of logarithmic size.
Abstract: We have two users, A and B, who hold documents x and y respectively. Neither of the users has any information about the other''s document. They exchange messages so that B computes x; it may be required that A compute y as well. Our goal is to design communication protocols with the main objective of minimizing the total number of bits they exchange; other objectives are minimizing the number of rounds and the complexity of internal computations. An important notion which determines the efficiency of the protocols is how one measures the distance between x and y. We consider several metrics for measuring this distance, namely the Hamming metric, the Levenshtein metric (edit distance), and a new LZ metric, which is introduced in this paper. We show how to estimate the distance between x and y using a single message of logarithmic size. For each metric, we present the first communication-efficient protocols, which often match the corresponding lower bounds. A consequence of these are error-correcting codes for these error models which correct up to d errors in n characters using O(d log n) bits. Our most interesting methods use a new histogram transformation that we introduce to convert edit distance to L1 distance.

133 citations


Patent
09 Jul 1999
TL;DR: A search system for information retrieval includes a data structure in the form of a non-evenly spaced sparse suffix tree for storing suffixes of words and/or symbols, or sequences thereof, in a text T and a query Q.
Abstract: A search system for information retrieval includes a data structure in the form of a non-evenly spaced sparse suffix tree for storing suffixes of words and/or symbols, or sequences thereof, in a text T, a metric M including combined edit distance metrics for an approximate degree of matching respectively between words and/or symbols, or between sequences thereof, in the text T and a query Q, the latter distance metric including weighting cost functions for edit operations which transform a sequence S of the text into a sequence P of the query Q, and search algorithms for determining the degree of matching respectively between words and/or symbols, or between sequences thereof, in respectively the text T and the query Q, such that information R is retrieved with a specified degree of matching with the query Q Optionally the search system also includes algorithms for determining exact matching such that information R may be retrieved with an exact degree of matching with the query Q

128 citations


Proceedings ArticleDOI
27 Sep 1999
TL;DR: The normalised edit distance of Marzal and Vidal (1993) can be used to model the probability distribution for structural errors in the graph-matching problem, and this probability distribution is used to locate matches using MAP label updates.
Abstract: This paper describes a novel framework for comparing and matching corrupted relational graphs. The normalised edit distance of Marzal and Vidal (1993) can be used to model the probability distribution for structural errors in the graph-matching problem. This probability distribution is used to locate matches using MAP label updates. We compare this criterion with that recently reported by Wilson and Hancock (1997). The use of edit distance offers an elegant alternative to the exhaustive compilation of label dictionaries. Moreover the method is polynomial rather than exponential in its worst-case complexity. We support our approach with an experimental study on synthetic data, and illustrate its effectiveness on an uncalibrated stereo correspondence problem.

126 citations


Journal ArticleDOI
TL;DR: The problem of video sequence-to-sequence matching as a pattern-matching problem is formulated and the vstring edit distance is proposed as a suitable distance measure for video sequences.

75 citations


Journal ArticleDOI
TL;DR: This paper develops signiicantly faster algorithms for a special class of strings which emerge frequently in pattern matching problems, and compares this run-length encoded string against the ith row or column of each of the character image-models.

71 citations


01 Jan 1999
TL;DR: This thesis gives a couple of modifications which make the EMD more amenable to partial matching, including the partial EMD in which only a given fraction of the weight in one distribution is forced to match weight in the other, and presents algorithms that are guaranteed to find a globally optimal transformation when matching equal-weight distributions under translation.
Abstract: This thesis is devoted to the Earth Mover's Distance and its use within content-based image retrieval (CBIR). The major CBIR problem discussed is the pattern problem: Given an image and a query pattern, determine if the image contains a region which is visually similar to the pattern; if so, find at least one such image region. The Earth Mover's Distance (EMD) is an edit distance between distributions that allows for partial matching, and which has many applications in CBIR. We give a couple of modifications which make the EMD more amenable to partial matching, including the partial EMD in which only a given fraction of the weight in one distribution is forced to match weight in the other. An important issue addressed in this thesis is the use of efficient, effective lower bounds on the EMD to speed up retrieval times. We contribute lower bounds that are applicable in the partial matching case in which distributions do not have the same total weight. Another important problem in CBIR is the EMD under transformation (EMD G ) problem: find a transformation of one distribution which minimizes its EMD to another, where the set of allowable transformations G is given. The problem of estimating the size/scale at which a pattern occurs in an image is phrased and efficiently solved as an EMD G problem. For EMD G problems with transformations that modify the points of a distribution but not its weights, we present a very general, monotonically convergent iteration called the FT iteration. This iteration may, however, converge to only a locally optimal EMD value and transformation. We also present algorithms that are guaranteed to find a globally optimal transformation when matching equal-weight distributions under translation. Our pattern problem solution is the SEDL (Scale Estimation for Directed Location) content-based image retrieval system. Three important contributions of this system are (1) a general framework for finding both color and shape patterns, (2) the previously mentioned novel scale estimation algorithm using the EMD, and (3) a directed (as opposed to exhaustive) search strategy. We show that SEDL achieves excellent results for the color pattern problem on a database of product advertisements, and the shape pattern problem on a database of Chinese characters.

64 citations


Journal ArticleDOI
01 Nov 1999
TL;DR: A simple and efficient algorithm that runs in time N2 is presented, based on duplicating one of the two sequences, and then performing a modified version of the standard dynamic programming algorithm, that performs very well.
Abstract: Motivation: Circular permutation of a protein is a genetic operation in which part of the C-terminal of the protein is moved to its N-terminal. Recently, it has been shown that proteins that undergo engineered circular permutations generally maintain their three dimensional structure and biological function. This observation raises the possibility that circular permutation has occured in Nature during evolution. In this scenario a protein underwent circular permutation into another protein, thereafter both proteins further diverged by standard genetic operations. To study this possibility one needs an efficient algorithm that for a given pair of proteins can detect the underlying event of circular permutations. A possible formal description of the question is: given two sequences, find a circular permutation of one of them under which the edit distance between the proteins is minimal. A naive algorithm might take time proportional to N 3 or even N 4 , which is prohibitively slow for a large-scale survey. A sophisticated algorithm that runs in asymptotic time of N 2 was recently suggested, but it is not practical for a large-scale survey. Results: A simple and efficient algorithm that runs in time N 2 is presented. The algorithm is based on duplicating one of the two sequences, and then performing a modified version of the standard dynamic programming algorithm. While the algorithm is not guaranteed to find the optimal results, we present data that indicate that in practice the algorithm performs very well. Availability: A Fortran program that calculates the optimal edit distance under circular permutation is available upon request from the authors.

55 citations


Book ChapterDOI
22 Jul 1999
TL;DR: A new indexing method based on a suffix tree combined with a partitioning of the pattern that outperforms by far all other algorithms for indexed approximate searching, and it is shown how this index can be implemented using much less space.
Abstract: We present a new indexing method for the approximate string matching problem. The method is based on a suffix tree combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the retrieval time is O(nλ), for 0 < λ < 1, whenever α < 1-e/√σ, where α is the error level tolerated and σ is the alphabet size. We experimentally show that this index outperforms by far all other algorithms for indexed approximate searching, also being the first experiments that compare the different existing schemes. We finally show how this index can be implemented using much less space.

43 citations


Proceedings ArticleDOI
20 Sep 1999
TL;DR: The proposed method converts a two-dimensional image into a one-dimensional string and computes the edit distance by the modified approximate string matching algorithm and presents the details of applications in handwriting analysis and both online and offline character recognition.
Abstract: Given two character images, we would like to measure their similarity or difference. Such a similarity or difference measure facilitates the solution to character recognition and handwriting analysis problems. There is, however, no universal definition for similarity measure satisfying a wide range of characteristics such as the slant, deformation or other invariant constraints. For this reason, we propose a new definition for the character similarity measure. First, the proposed method converts a two-dimensional image into a one-dimensional string. Next, it computes the edit distance by the modified approximate string matching algorithm. We describe how to extract the string information and compute the distance and then present the details of applications in handwriting analysis and both online and offline character recognition.

27 citations


Proceedings ArticleDOI
24 Apr 1999
TL;DR: An O(mn log n)-time algorithm for the problem of normalized edit distance computation when the cost function is uniform, except substitutions can have different weights depending on whether they are matching or non-matching.
Abstract: A common model for computing the similarity of two strings X and Y of lengths m, and n respectively with m/spl ges/n, is to transform X into Y through a sequence of three types of edit operations: insertion, deletion, and substitution. The model assumes a given cost function which assigns a non-negative real weight to each edit operation. The amortized weight for a given edit sequence is the ratio of its weight to its length, and the minimum of this ratio over all edit sequences is the normalized edit distance. Existing algorithms for normalized edit distance computation with proven complexity bounds require O(mn/sup 2/) time in the worst-case. We give an O(mn log n)-time algorithm for the problem when the cost function is uniform, i.e., the weight of each edit operation is constant within the same type, except substitutions can have different weights depending on whether they are matching or non-matching.

Book ChapterDOI
22 Jul 1999
TL;DR: GESTALT (GEnomic sequences STeiner ALignmenT), a public-domain suite of programs for generating multiple alignments of a set of biosequences, is described, which includes the traditional space-saving ideas of Hirschberg as well as new data-packing techniques.
Abstract: We describe GESTALT (GEnomic sequences STeiner ALignmenT), a public-domain suite of programs for generating multiple alignments of a set of biosequences.We allow the use of either of the two popular objectives, Tree Alignment or Sum-of-Pairs. The main distinguishing feature of our method is that the alignment is obtained via a tree in which the internal nodes (ancestors) are labeled by Steiner sequences for triples of the input sequences. Given lists of candidate labels for the ancestral sequences, we use dynamic programming to choose an optimal labeling under either objective function. Finally, the fully labeled tree of sequences is turned into into a multiple alignment. Enhancements in our implementation include the traditional space-saving ideas of Hirschberg as well as new data-packing techniques. The running-time bottleneck of computing exact Steiner sequences is handled by a highly effective but much faster heuristic alternative. Finally, other modules in the suite allow automatic generation of linear-program input files that can be used to compute new lower bounds on the optimal values. We also report on some preliminary computational experiments with GESTALT.

Journal ArticleDOI
TL;DR: This new technique has several advantages over previous methods for determining alignments in linear space, such as: simplicity, the ability to use essentially the same technique when using different cost functions, and the practical advantage of easily being able to trade available memory for running time.

Journal ArticleDOI
TL;DR: This paper includes the swapoperation that interchanges two adjacent characters into the set of allowable edit operations, and presents anO(tmin(m,n))-time algorithm for the extended edit distance problem, where tmin represents the edit distance between the given strings, and n represents the extendedk-differences problem.

Journal ArticleDOI
TL;DR: The proposed algorithm solves the problem in time O(d2×T1|×|T2|), which is as fast as the best known algorithm for calculating Selkow's distance of two trees when the distance allowed in the common substructures is a constant independent of the input trees.

Proceedings ArticleDOI
27 Sep 1999
TL;DR: A method which represents the shape of a hand as the union of a set of identically parameterized regions that is based on minimizing the edit distance between the model and the model obtained from image frames acquired at successive time instants is proposed.
Abstract: Towards developing a gesture-based interface for human computer interaction (HCI), we propose a method which represents the shape of a hand as the union of a set of identically parameterized (disk-shaped) regions. Connectivity between the disk shaped regions is then imposed to allow for the accurate modeling of the hand. Signatures for gestures are then generated based on minimizing the edit distance between the model (consisting of both the disks and a connectivity-preserving tree structure) obtained from image frames acquired at successive time instants.

Proceedings ArticleDOI
21 Sep 1999
TL;DR: This work introduces a new formalism for a class of applications that takes two strings as input, each specified in terms of a particular domain, and performs a comparison motivated by constraints derived from a third, possibly different domain.
Abstract: Approximate string matching is an important paradigm in domains ranging from speech recognition to information retrieval and molecular biology. We introduce a new formalism for a class of applications that takes two strings as input, each specified in terms of a particular domain, and performs a comparison motivated by constraints derived from a third, possibly different domain. This issue arises, for example, when searching multimedia databases built using imperfect recognition technologies (e.g., speech, optical character, and handwriting recognition). We present a polynomial time algorithm for solving the problem, and describe several variations that can also be solved efficiently.

Book ChapterDOI
22 Jul 1999
TL;DR: The study of approximately periodic strings is relevant to diverse applications such as molecular biology, data compression, and computer-assisted music analysis and it is shown that the third problem is NP-complete.
Abstract: The study of approximately periodic strings is relevant to diverse applications such as molecular biology, data compression, and computer-assisted music analysis. Here we study different forms of approximate periodicity under a variety of distance rules.We consider three related problems, for two of which we derive polynomial-time algorithms; we then show that the third problem is NP-complete.

Book ChapterDOI
22 Jul 1999
TL;DR: This work introduces a O(α2mn + α4(m + n) algorithm for dating such a sample sequence against an error-free master sequence, where n and m are the lengths of the sequences.
Abstract: In dendrochronology wood samples are dated according to the tree rings they contain. The dating process consists of comparing the sequence of tree ring widths in the sample to a dated master sequence. Assuming that a tree forms exactly one ring per year a simple sliding algorithm solves this matching task. But sometimes a tree produces no ring or even two rings in a year. If a sample sequence contains this kind of inconsistencies it cannot be dated correctly by the simple sliding algorithm. We therefore introduce a O(α2mn + α4(m + n)) algorithm for dating such a sample sequence against an error-free master sequence, where n and m are the lengths of the sequences. Our algorithm takes into account that the sample might contain up to α missing or double rings and suggests possible positions for these kind of inconsistencies. This is done by employing an edit distance as the distance measure.


Proceedings ArticleDOI
22 Dec 1999
TL;DR: In this article, the problem of writer identification based on similarity is formalized by defining a distance between character or word level features and finding the most similar writings or all writings which are within a certain threshold distance.
Abstract: The problem of Writer Identification based on similarity is formalized by defining a distance between character or word level features and finding the most similar writings or all writings which are within a certain threshold distance. Among many features, we consider stroke direction and pressure sequence strings of a character as character level image signatures for writer identification. As the conventional definition of edit distance is not applicable in essence, we present the newly defined and modified edit distances depending upon their measurement types. Finally, we present a prototype stroke directional and pressure sequence string extractor used on the writer identification. The importance of this study is the attempt to give a definition of distance between two characters based on the two types of strings.

Book
01 Jan 1999
TL;DR: A General Practical Approach to Pattern Matching over Ziv-Lempel Compressed Text and a Dynamic Data Structure for Reverse Lexicographically Sorted Prefixes for Approximate String Matching.
Abstract: Shift-And Approach to Pattern Matching in LZW Compressed Text.- A General Practical Approach to Pattern Matching over Ziv-Lempel Compressed Text.- Pattern Matching in Text Compressed by Using Antidictionaries.- On the Structure of Syntenic Distance.- Physical Mapping with Repeated Probes: The Hypergraph Superstring Problem.- Hybridization and Genome Rearrangement.- On the Complexity of Positional Sequencing by Hybridization.- GESTALT: Genomic Steiner Alignments.- Bounds on the Number of String Subsequences.- Approximate Periods of Strings.- Finding Maximal Pairs with Bounded Gap.- A Dynamic Data Structure for Reverse Lexicographically Sorted Prefixes.- A New Indexing Method for Approximate String Matching.- The Compression of Subsegments of Images Described by Finite Automata.- Ziv Lempel Compression of Huge Natural Language Data Tries Using Suffix Arrays.- Matching of Spots in 2D Electrophoresis Images. Point Matching Under Non-uniform Distortions.- Applying an Edit Distance to the Matching of Tree Ring Sequences in Dendrochronology.- Fast Multi-dimensional Approximate Pattern Matching.- Finding Common RNA Secondary Structures from RNA Sequences.- Finding Common Subsequences with Arcs and Pseudoknots.- Computing Similarity between RNA Structures.