scispace - formally typeset
Search or ask a question

Showing papers on "Edit distance published in 1997"


Patent
25 Jan 1997
TL;DR: In this paper, a document browser for electronic filing systems, which supports pen-based markup and annotation, is described, where the user may electronically write notes (60-64) anywhere on a page (32, 38) and then later search for those notes using the approximate ink matching (AIM) technique.
Abstract: In summary there is disclosed a document browser for electronic filing systems, which supports pen-based markup and annotation. The user may electronically write notes (60-64) anywhere on a page (32, 38) and then later search for those notes using the approximate ink matching (AIM) technique. The technique segments (104) the user-drawn strokes, extracts (108) and vector quantizes (112) features contained in those strokes. An edit distance comparison technique (118) is used to query the database (120), rendering the system capable of performing approximate or partial matches to achieve fuzzy search capability.

299 citations


Journal ArticleDOI
TL;DR: This paper examines string block edit distance, in which two strings A and B are compared by extracting collections of substrings and placing them into correspondence, and shows that several variants are NPcomplete and give polynomial-time algorithms for solving the remainder.

118 citations


Proceedings ArticleDOI
01 Jan 1997
TL;DR: This paper proposes an indexing scheme which is totally based on lengths and relative distances between sequences, and uses vp-trees as the underlying distance-based index structures in its method.
Abstract: In this paper, we consider the problem of efficient matching and retrieval of sequences of different lengths. Most of the previous research is concentrated on similarity matching and retrieval of sequences of the same length using Euclidean distance metric. For similarity matching of sequences, we use a modified version of the edit distance function, and consider two sequences matching if a majority of the elements in the sequences match. In the matching process a mapping among non-matching elements is created to check if there are unacceptable deviations among them. This means that two matching sequences should have lengths that are comparable. For efficient retrieval of matching sequences, we propose an indexing scheme which is totally based on lengths and relative distances between sequences. We use vp-trees as the underlying distance-based index structures in our method.

105 citations


Book ChapterDOI
01 Apr 1997
TL;DR: Serial and parallel algorithmic solutions for the string editing problem for input strings x and y are described, which models a variety of problems arising in such diverse areas as text and speech processing, geology and, last but not least, molecular biology.
Abstract: The string editing problem for input strings x and y consists of transforming x into y by performing a series of weighted edit operations on x of overall minimum cost. An edit operation on x can be the deletion of a symbol from x, the insertion of a symbol in x or the substitution of a symbol of x with another symbol. String editing models a variety of problems arising in such diverse areas as text and speech processing, geology and, last but not least, molecular biology. Special cases of string editing include the longest common subsequence problem, local alignment and similarity searching in DNA and protein sequences, and approximate string searching. We describe serial and parallel algorithmic solutions for the problem and some of its basic variants.

88 citations


Proceedings ArticleDOI
12 Oct 1997
TL;DR: A distance metric called "edit" distance is described which quantifies the syntactic difference between two genetic programs and the relationships between these data and run performance are imprecise but they are sufficiently interesting to encourage further investigation into the use of edit distance.
Abstract: I describe a distance metric called "edit" distance which quantifies the syntactic difference between two genetic programs. In the context of one specific problem, the 6 bit multiplexor, I use the metric to analyze the amount of new material introduced by different crossover operators, the difference among the best individuals of a population and the difference among the best individuals and the rest of the population. The relationships between these data and run performance are imprecise but they are sufficiently interesting to encourage further investigation into the use of edit distance.

80 citations


Journal ArticleDOI
TL;DR: It is shown that cost functions having the same ratio of the sum of the insertion and deletion costs divided by the substitution cost yield the same minimum cost sequences of edit operations, which leads to a partitioning of the universe of cost functions into equivalence classes.
Abstract: Finding a sequence of edit operations that transforms one string of symbols into another with the minimum cost is a well-known problem. The minimum cost, or edit distance, is a widely used measure of the similarity of two strings. An important parameter of this problem is the cost function, which specifies the cost of each insertion, deletion, and substitution. We show that cost functions having the same ratio of the sum of the insertion and deletion costs divided by the substitution cost yield the same minimum cost sequences of edit operations. This leads to a partitioning of the universe of cost functions into equivalence classes. Also, we show the relationship between a particular set of cost functions and the longest common subsequence of the input strings.

48 citations


Book ChapterDOI
17 Aug 1997
TL;DR: By systematic computer simulations, it is shown that the minimum output segment length required for a successful attack is linear in the total length of the two stop/go clocked shift registers.
Abstract: A novel edit distance between two binary input strings and one binary output string of appropriate lengths which incorporates the stop/go clocking in the alternating step generator is introduced. An efficient recursive algorithm for the edit distance computation is derived. The corresponding correlation attack on the two stop/go clocked shift registers is then proposed. By systematic computer simulations, it is shown that the minimum output segment length required for a successful attack is linear in the total length of the two stop/go clocked shift registers. This is verified by experimental attacks on relatively short shift registers.

42 citations


Journal Article
TL;DR: In this article, a novel edit distance between two binary input strings and one binary output string of appropriate length which incorporates the stop/go clocking in the alternating step generator is introduced.
Abstract: A novel edit distance between two binary input strings and one binary output string of appropriate lengths which incorporates the stop/go clocking in the alternating step generator is introduced. An efficient recursive algorithm for the edit distance computation is derived. The corresponding correlation attack on the two stop/go clocked shift registers is then proposed. By systematic computer simulations, it is shown that the minimum output segment length required for a successful attack is linear in the total length of the two stop/go clocked shift registers. This is verified by experimental attacks on relatively short shift registers.

40 citations


Book
29 May 1997
TL;DR: This chapter focuses on the problem of evaluating a longest common subsequence, which is expressively equivalent to the simple form of the Levenshtein distance.
Abstract: In the previous chapters, we discussed problems involving an exact match of string patterns. We now turn to problems involving similar but not necessarily exact pattern matches. There are a number of similarity or distance measures, and many of them are special cases or generalizations of the Levenshtein metric. The problem of evaluating the measure of string similarity has numerous applications, including one arising in the study of the evolution of long molecules such as proteins. In this chapter, we focus on the problem of evaluating a longest common subsequence, which is expressively equivalent to the simple form of the Levenshtein distance.

37 citations


Proceedings Article
08 Jul 1997
TL;DR: In this paper, a stochastic model for string-edit distance is proposed, which is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.
Abstract: In many applications, it is necessary to determine the similarity of two strings. A widely-used notion of string similarity is the edit distance: the minimum number of insertions, deletions, and substitutions required to transform one string into the other. In this report, we provide a stochastic model for string-edit distance. Our stochastic model allows us to learn a string-edit distance function from a corpus of examples. We illustrate the utility of our approach by applying it to the difficult problem of learning the pronunciation of words in conversational speech. In this application, we learn a string-edit distance with nearly one-fifth the error rate of the untrained Levenshtein distance. Our approach is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.

28 citations


Journal ArticleDOI
TL;DR: A peptide matching approach to the multiple comparison of a set of protein sequences by looking for all the words that are common to q of these sequences, where q is a parameter.

Book ChapterDOI
22 Aug 1997
TL;DR: This paper deals with the recognition of symbols and structural textures in architectural plans using string matching techniques using string codification to represent the sequence of outlining edges of a region.
Abstract: This paper deals with the recognition of symbols and structural textures in architectural plans using string matching techniques. A plan is represented by an attributed graph whose nodes represent characteristic points and whose edges represent segments. Symbols and textures can be seen as a set of regions, i.e. closed loops in the graph, with a particular arrangement. The search for a symbol involves a graph matching between the regions of a model graph and the regions of the graph representing the document. Discriminating a texture means a clustering of neighbouring regions of this graph. Both procedures involve a similarity measure between graph regions. A string codification is used to represent the sequence of outlining edges of a region. Thus, the similarity between two regions is defined in terms of the string edit distance between their boundary strings. The use of string matching allows the recognition method to work also under presence of distortion.


Book ChapterDOI
30 Jun 1997
TL;DR: The unbiased estimator herein is shown to give good results in a matter of a thousand samples even for small probability patterns, which is expected to improve the performance of Anrep and may have utility in estimating the significance of similarity searches.
Abstract: While considerable effort and some progress has been made on developing an analytic formula for the probability of an approximate match, such work has not achieved fruition [4, 6, 2, 1]. Therefore, we consider here the development of an unbiased estimation procedure for determining said probability given a specific string P ∈ Σ and a specific cost function δ for weighting edit operations. Problems of this type are of general interest, see for example a recent paper [5] giving an unbiased estimator for counting the words of a fixed length in a regular language. We were further motivated by a particular application arising in the pattern matching system Anrep designed by us for use in genomic sequence analysis [8, 11]. Anrep accomplishes a search for a complex pattern by backtracking over subprocedures that find approximate matches. The subpatterns are searched in an order that attempts to minimize the expected running time of the search. Determining this optimal backtrack order requires a reasonably accurate estimate of the probability with which one will find an approximate match to each subpattern. Given that the probabilities involved are frequently 10 or less, the simple expedient of measuring match frequency over a random text of several thousand characters has been less than satisfactory. The unbiased estimator herein is shown to give good results in a matter of a thousand samples even for small probability patterns. Thus it is expected to improve the performance of Anrep and may have utility in estimating the significance of similarity searches. Proceeding formally, suppose we are given

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: This work considers the problem of finding the longest common subsequence of two strings, and develops significantly faster algorithms for a special class of strings which emerge frequently in pattern matching problems.
Abstract: Measuring the similarity between two strings, through such standard measures as Hamming distance, edit distance, and longest common subsequence, is one of the fundamental problems in pattern matching. We consider the problem of finding the longest common subsequence of two strings. A well-known dynamic programming algorithm computes the longest common subsequence of strings X and Y in O(|X|/spl middot/|Y|) time. We develop significantly faster algorithms for a special class of strings which emerge frequently in pattern matching problems. A string S is run-length encoded if it is described as an ordered sequence of pairs (/spl sigma/,i), each consisting of an alphabet symbol /spl sigma/ and an integer i. Each pair corresponds to a run in S consisting of i consecutive occurrences of /spl sigma/. For example, the string aaaabbbbcccabbbbcc can be encoded as a/sup 4/b/sup 4/c/sup 3/a/sup 1/b/sup 4/c/sup 2/. Such a run-length encoded string can be significantly shorter than the expanded string representation. Indeed, runlength coding serves as a popular image compression technique, since many classes of images, such as binary images in facsimile transmission, typically contain large patches of identically-valued pixels.

Proceedings ArticleDOI
18 Dec 1997
TL;DR: Three algorithms for string matching on reconfigurable mesh architectures are presented and the first algorithm finds the exact matching between T and P in O(1) time on a 2-dimensional RMESH of size (n-m+1)/spl times/m.
Abstract: String matching problem received much attention over the years due to its importance in various applications such as text/file comparison, DNA sequencing, search engines, and spelling correction. Especially with the introduction of search engines dealing with tremendous amount of textual information presented on the world wide web and the research on DNA sequencing, this problem deserves special attention and any algorithmic or hardware improvements to speed up the process will benefit these important applications. In this paper, we present three algorithms for string matching on reconfigurable mesh architectures. Given a text T of length n and a pattern P of length m, the first algorithm finds the exact matching between T and P in O(1) time on a 2-dimensional RMESH of size (n-m+1)/spl times/m. The second algorithm finds the approximate matching between T and P in O(k) time on a 2D RMESH, where k is the maximum edit distance between T and P. The third algorithm allows only the replacement operation in the calculation of the edit distance and finds an approximate matching between T and P in constant-time on a 3D RMESH.

Proceedings ArticleDOI
01 Jul 1997
TL;DR: A new approach to measuring the similarity of 3D curves is presented, based on an extension of the classical string edit distance that allows the possibility to use strings, where each element can be a vector rather than a single symbol.
Abstract: In this paper a new approach to measuring the similarity of 3D curves is presented. This approach is based on an extension of the classical string edit distance in two ways. The first extension is the possibility to use strings, where each element can be a vector rather than a single symbol, while the second extension is the use of fuzzy set based cost functions in the edit distance computation. These two extensions allow us to tackle various problems, that can't be solved by means of "classical" string edit distance.

Book ChapterDOI
30 Jun 1997
TL;DR: This paper includes the swap operation that interchanges two adjacent characters into the set of allowable edit operations, and presents an O(t min(m,n)-time algorithm for the extended edit distance problem, where t is the edit distance between the given strings.
Abstract: Most research on the edit distance problem and the k-differences problem considered the set of edit operations consisting of changes, deletions, and insertions. In this paper we include the swap operation that interchanges two adjacent characters into the set of allowable edit operations, and we present an O(t min(m,n))-time algorithm for the extended edit distance problem, where t is the edit distance between the given strings, and an O(kn)-time algorithm for the extended k-differences problem. That is, we add swaps into the set of edit operations without increasing the time complexities of previous algorithms that consider only changes, deletions, and insertions for the edit distance and k-differences problems.

Proceedings ArticleDOI
12 Oct 1997
TL;DR: It is shown how the classifier can be trained to get the optimal parametric distance using vector quantization in the meta-space, and report classification results after such a training process.
Abstract: Considers a fundamental problem in syntactic pattern recognition in which we are required to recognize a string from its noisy version. We assume that the system has a dictionary which is a collection of all the ideal representations of the objects in question. When a noisy sample has to be processed, the system compares it with every element in the dictionary based on a nearest-neighbor philosophy. This is typically achieved using three standard edit operations-substitution, insertion and deletion. To accomplish this, one usually assigns a distance for the elementary symbol operations, d(.,.), and the inter-pattern distance, D(.,.), is computed as a function of these symbol edit distances. In this paper, we consider the assignment of the inter-symbol distances in terms of the novel and interesting assignments-the parametric distances-introduced by Bunke et al. (1993). We show how the classifier can be trained to get the optimal parametric distance using vector quantization in the meta-space, and report classification results after such a training process. In all our experiments, the training was typically achieved in a very few iterations. The subsequent classification accuracy we obtained using this single-parameter scheme was 96.13%. The power of the scheme is obvious if we compare it to 96.67%, which is the accuracy of the scheme which uses the complete array of inter-symbol distances derived from a knowledge of all the confusion probabilities.


Proceedings ArticleDOI
20 Oct 1997
TL;DR: The systolic solution for approximate string matching is modified and extended for the OCS problem in this paper and the architecture presented here can also be used to determine the minimum edit distance, the Longest Common Subsequence (LCS) and its length.
Abstract: The string matching problem arises in many fields of text analysis, image analysis and speech recognition. The computationally intensive nature of string matching makes it a candidate for VLSI implementation. Most of the existing algorithms and architectures for string matching consider strings that are from a finite alphabet set. The Optimal Correspondence of String Subsequence (OCS) problem, on the other hand, considers strings from an infinite alphabet set. This paper describes the design of a linear systolic array VLSI architecture for the OCS problem. The systolic solution for approximate string matching is modified and extended for the OCS problem in this paper. The architecture presented here can also be used to determine the minimum edit distance, the Longest Common Subsequence (LCS) and its length. The systolic architecture was simulated and verified using the Cadence design tools.


Journal ArticleDOI
TL;DR: The problem of finding the unrestricted modified edit distance which is the minimum cost over all edit sequences (without these constraints) of converting X to Y is undecidable.

Journal Article
TL;DR: The paper contains a brief description of OOmmen's constrained edit distance algorithm and some results of a simulation experiment regarding accuracy and execution speed of the algorithm depending on the probability of the insertion, deletion and substitution errors.
Abstract: Searching the sequential file to detect a given substring is a common problem appearing, among others, in text processors and database search systems. One possible approach is OOmmen's constrained edit distance algorithm. The paper contains a brief description of this algorithm and some results of a simulation experiment regarding accuracy and execution speed of the algorithm depending on the probability of the insertion, deletion and substitution errors. The paper also presents one possible practical application of the algorithm to search procedures. Application is based on reduction of the dictionary size according to the probability of editing errors and organization of the contents of the dictionary in a way to so as speed up the search process. Some comparative simulation results are presented illustrating direct and suggested practical applications of the constrained edit distance algorithm to search procedures.