scispace - formally typeset
Search or ask a question

Showing papers on "Edit distance published in 1994"


Proceedings ArticleDOI
24 May 1994
TL;DR: This paper presents an example of combinatorial pattern discovery: the discovery of patterns in protein databases, which give information that is complementary to the best protein classifier available today.
Abstract: Suppose you are given a set of natural entities (e.g., proteins, organisms, weather patterns, etc.) that possess some important common externally observable properties. You also have a structural description of the entities (e.g., sequence, topological, or geometrical data) and a distance metric. Combinatorial pattern discovery is the activity of finding patterns in the structural data that might explain these common properties based on the metric.This paper presents an example of combinatorial pattern discovery: the discovery of patterns in protein databases. The structural representation we consider are strings and the distance metric is string edit distance permitting variable length don't cares. Our techniques incorporate string matching algorithms and novel heuristics for discovery and optimization, most of which generalize to other combinatorial structures. Experimental results of applying the techniques to both generated data and functionally related protein families obtained from the Cold Spring Harbor Laboratory show the effectiveness of the proposed techniques. When we apply the discovered patterns to perform protein classification, they give information that is complementary to the best protein classifier available today.

193 citations


Journal ArticleDOI
TL;DR: Given two rooted, labeled, and unordered trees, it is shown that both problems are MAX SNP-hard, which means that neither problem has a polynomial time approximation scheme (PTAS) unless P = NP.

159 citations


Book ChapterDOI
05 Jun 1994
TL;DR: This paper describes how the distance-based sublinear expected time algorithm of Chang and Lawler can be extended to solve efficiently the local similarity problem and presents a new theoretical result, polynomialspace, constant-fraction-error matching that is provably optimal.
Abstract: The best known rigorous method for biological sequence comparison has been the algorithm of Smith and Waterman. It computes in quadratic time the highest scoring local alignment of two sequences given a (nonmetric) similarity measure and gap penalty. In this paper, we describe how the distance-based sublinear expected time algorithm of Chang and Lawler can be extended to solve efficiently the local similarity problem. We present both a new theoretical result, polynomialspace, constant-fraction-error matching that is provably optimal, and a practical adaptation of it that produces nearly identical results as Smith-Waterman, at speedups of 2X (PAM 120, roughly corresponding to 33% identity) to 10X (PAM 90, 50% identity) or better. Further improvements are anticipated. What makes this possible is the addition of a new constraint on unit score (average score per residue), which filters out both very short alignments and very long alignments with unacceptably low average. This program is part of a package called Genome Analyst that is being developed at CSHL.

89 citations


Patent
Richard Hull1
28 Oct 1994
TL;DR: In this paper, an improved method of matching a query string against a plurality of candidate strings replaces a highly computationally intensive string edit distance calculation with a less computational intensive lower bound estimate.
Abstract: An improved method of matching a query string against a plurality of candidate strings replaces a highly computationally intensive string edit distance calculation with a less computationally intensive lower bound estimate. The lower bound estimate of the string edit distance between the two strings is calculated by equalising the lengths of the two strings by adding padding elements to the shorter one. The elements of the strings are then sorted and the substitution costs between corresponding elements are summed.

50 citations


Posted Content
TL;DR: An approach to spelling correction in agglutinative languages that is based on two-level morphology and a dynamic programming based search algorithm and results indicate that the intended correct word can be found in 95% of the cases.
Abstract: This paper presents an approach to spelling correction in agglutinative languages that is based on two-level morphology and a dynamic programming based search algorithm. Spelling correction in agglutinative languages is significantly different than in languages like English. The concept of a word in such languages is much wider that the entries found in a dictionary, owing to {}~productive word formation by derivational and inflectional affixations. After an overview of certain issues and relevant mathematical preliminaries, we formally present the problem and our solution. We then present results from our experiments with spelling correction in Turkish, a Ural--Altaic agglutinative language. Our results indicate that we can find the intended correct word in 95\% of the cases and offer it as the first candidate in 74\% of the cases, when the edit distance is 1.

46 citations


Journal ArticleDOI
TL;DR: An implementation of the dynamic programming algorithm for this problem is given that packs several characters and mod‐4 integers into a computer word and a 21‐fold parallelism over the conventional algorithm can be obtained.
Abstract: Given a text string, a pattern string, and an integer k, the problem of approximate string matching with k differences is to find all substrings of the text string whose edit distance from the pattern string is less than k. The edit distance between two strings is defined as the minimum number of differences, where a difference can be a substitution, insertion, or deletion of a single character. An implementation of the dynamic programming algorithm for this problem is given that packs several characters and mod-4 integers into a computer word. Thus, it is a parallelization of the conventional implementation that runs on ordinary processors. Since a small alphabet means that characters have short binary codes, the degree of parallelism is greatest for small alphabets and for processors with long words. For an alphabet of size 8 or smaller and a 64 bit processor, a 21-fold parallelism over the conventional algorithm can be obtained. Empirical comparisons to the basic dynamic programming algorithm, to a version of Ukkonen's algorithm, to the algorithm of Galil and Park, and to a limited implementation of the Wu-Manber algorithm are given.

30 citations


Proceedings ArticleDOI
09 Oct 1994
TL;DR: A modified normalized edit distance is presented that expresses the edit distance between two strings X and Y in a more adequate and intuitive way, reflecting the human decision process during comparisons.
Abstract: In this paper, we discuss the weighted edit distance and two well known normalizations, one based on editing path lengths and one based on the string lengths. We investigate the limitations of these approaches as well as the restrictions on the associated weight function including the triangular inequality. As a solution to the problems pointed out, we present a modified normalized edit distance. The new approach expresses the edit distance between two strings X and Y in a more adequate and intuitive way, reflecting the human decision process during comparisons. A further advantage is that this new distance measure is efficiently computable in O(|X|/spl times/|Y|) instead of O(|X|/spl times/|Y|/spl times/min (|X|,|Y|)) for the other normalizations.

25 citations


31 Dec 1994
TL;DR: The surprising result that OCR errors are not always uniformly distributed across a page is presented, and an algorithm is described based on a well-known dynamic programming approach for determining string edit distance which is extended to handle the types of character segmentation errors inherent to OCR.
Abstract: In this paper we present the surprising result that OCR errors are not always uniformly distributed across a page. Under certain circumstances, 30% or more of the errors incurred can be attributed to a single, avoidable phenomenon in the scanning process. This observation has important ramifications for work that explicitly or implicitly assumes a uniform error distribution. In addition, our experiments show that not just the quantity but also the nature of the errors is affected. This could have an impact on strategies used for post-process error correction. Results such as these can be obtained only by analyzing large quantities of data in a controlled way. To this end, we also describe our algorithm for classifying OCR errors. This is based on a well-known dynamic programming approach for determining string edit distance which we have extended to handle the types of character segmentation errors inherent to OCR.

18 citations


Patent
30 Sep 1994
TL;DR: In this paper, a VLSI circuit structure for computing the edit distance between two strings over a given alphabet is presented, which can perform approximate string matching for variable edit costs, and does not place any constraint on the lengths of the strings that can be compared.
Abstract: The edit distance between two strings a1, . . . , am and b1, . . . , bn is the minimum cost s of a sequence of editing operations (insertions, deletions and substitutions) that convert one string into the other. This invention provides VLSI circuit structure for computing the edit distance between two strings over a given alphabet. The circuit structure can perform approximate string matching for variable edit costs. More importantly, the circuit structure does not place any constraint on the lengths of the strings that can be compared. It makes use of simple basic cells and requires regular nearest-neighbor communication, which makes it suitable for VLSI implementation.

15 citations


Book ChapterDOI
10 Oct 1994
TL;DR: An efficient algorithm for finding approximate repetitions in a given sequence of characters and provides an algorithm by which the expression can be restored in linear time in the length of the example and no worse than quadratic in thelength of the expression.
Abstract: We present an efficient algorithm for finding approximate repetitions in a given sequence of characters. First, we define a class of simple regular expressions which are of star-height one and do not contain union operations, and a stochastic mutation process of a given length over a string of characters. Then, assuming that a given string of characters is obtained corrupted by the defined mutation process from some long enough word generated by a simple regular expression, we try to restore the expression. We prove that to within some reasonable accuracy it is always possible if the length of the mutation process is bounded comparing to the length of the example. We provide an algorithm by which the expression can be restored in linear time in the length of the example and no worse than quadratic in the length of the expression. We discuss some extensions of the method and possible applications to bioinformatics.

9 citations


Journal ArticleDOI
TL;DR: A simple modification of the Hirschberg alignment algorithm can sample string alignments at random according to their probability distribution, useful for statistical estimation of evolutionary distances of a family of strings, e.g. DNA strings.