Showing papers on "Edit distance published in 1994"

PDF

Open Access

Proceedings Article•DOI•

Combinatorial pattern discovery for scientific data: some preliminary results

[...]

Jason T. L. Wang¹, Gung-Wei Chirn¹, Thomas G. Marr², Bruce A. Shapiro³, Dennis Shasha⁴, Kaizhong Zhang⁵ - Show less +2 more•Institutions (5)

New Jersey Institute of Technology¹, Cold Spring Harbor Laboratory², National Institutes of Health³, Courant Institute of Mathematical Sciences⁴, University of Western Ontario⁵

24 May 1994

TL;DR: This paper presents an example of combinatorial pattern discovery: the discovery of patterns in protein databases, which give information that is complementary to the best protein classifier available today.

...read moreread less

Abstract: Suppose you are given a set of natural entities (e.g., proteins, organisms, weather patterns, etc.) that possess some important common externally observable properties. You also have a structural description of the entities (e.g., sequence, topological, or geometrical data) and a distance metric. Combinatorial pattern discovery is the activity of finding patterns in the structural data that might explain these common properties based on the metric.This paper presents an example of combinatorial pattern discovery: the discovery of patterns in protein databases. The structural representation we consider are strings and the distance metric is string edit distance permitting variable length don't cares. Our techniques incorporate string matching algorithms and novel heuristics for discovery and optimization, most of which generalize to other combinatorial structures. Experimental results of applying the techniques to both generated data and functionally related protein families obtained from the Cold Spring Harbor Laboratory show the effectiveness of the proposed techniques. When we apply the discovered patterns to perform protein classification, they give information that is complementary to the best protein classifier available today.

...read moreread less

193 citations

Journal Article•DOI•

Some MAX SNP-hard results concerning unordered labeled trees

[...]

Kaizhong Zhang¹, Tao Jiang²•Institutions (2)

University of Western Ontario¹, McMaster University²

11 Mar 1994-Information Processing Letters

TL;DR: Given two rooted, labeled, and unordered trees, it is shown that both problems are MAX SNP-hard, which means that neither problem has a polynomial time approximation scheme (PTAS) unless P = NP.

...read moreread less

159 citations

Book Chapter•DOI•

Approximate String Matching and Local Similarity

[...]

William I. Chang¹, Thomas G. Marr¹•Institutions (1)

Cold Spring Harbor Laboratory¹

05 Jun 1994

TL;DR: This paper describes how the distance-based sublinear expected time algorithm of Chang and Lawler can be extended to solve efficiently the local similarity problem and presents a new theoretical result, polynomialspace, constant-fraction-error matching that is provably optimal.

...read moreread less

Abstract: The best known rigorous method for biological sequence comparison has been the algorithm of Smith and Waterman. It computes in quadratic time the highest scoring local alignment of two sequences given a (nonmetric) similarity measure and gap penalty. In this paper, we describe how the distance-based sublinear expected time algorithm of Chang and Lawler can be extended to solve efficiently the local similarity problem. We present both a new theoretical result, polynomialspace, constant-fraction-error matching that is provably optimal, and a practical adaptation of it that produces nearly identical results as Smith-Waterman, at speedups of 2X (PAM 120, roughly corresponding to 33% identity) to 10X (PAM 90, 50% identity) or better. Further improvements are anticipated. What makes this possible is the addition of a new constraint on unit score (average score per residue), which filters out both very short alignments and very long alignments with unacceptably low average. This program is part of a package called Genome Analyst that is being developed at CSHL.

...read moreread less

89 citations

Patent•

Method for performing string matching

[...]

Richard Hull¹•Institutions (1)

Hewlett-Packard¹

28 Oct 1994

TL;DR: In this paper, an improved method of matching a query string against a plurality of candidate strings replaces a highly computationally intensive string edit distance calculation with a less computational intensive lower bound estimate.

...read moreread less

Abstract: An improved method of matching a query string against a plurality of candidate strings replaces a highly computationally intensive string edit distance calculation with a less computationally intensive lower bound estimate. The lower bound estimate of the string edit distance between the two strings is calculated by equalising the lengths of the two strings by adding padding elements to the shorter one. The elements of the strings are then sorted and the substitution costs between corresponding elements are summed.

...read moreread less

50 citations

Posted Content•

Spelling Correction in Agglutinative Languages

[...]

Kemal Oflazer

06 Oct 1994-arXiv: Computation and Language

TL;DR: An approach to spelling correction in agglutinative languages that is based on two-level morphology and a dynamic programming based search algorithm and results indicate that the intended correct word can be found in 95% of the cases.

...read moreread less

Abstract: This paper presents an approach to spelling correction in agglutinative languages that is based on two-level morphology and a dynamic programming based search algorithm. Spelling correction in agglutinative languages is significantly different than in languages like English. The concept of a word in such languages is much wider that the entries found in a dictionary, owing to {}~productive word formation by derivational and inflectional affixations. After an overview of certain issues and relevant mathematical preliminaries, we formally present the problem and our solution. We then present results from our experiments with spelling correction in Turkish, a Ural--Altaic agglutinative language. Our results indicate that we can find the intended correct word in 95\% of the cases and offer it as the first candidate in 74\% of the cases, when the edit distance is 1.

...read moreread less

46 citations

Journal Article•DOI•

Approximate string matching using within-word parallelism

[...]

Alden H. Wright¹•Institutions (1)

University of Montana¹

01 Apr 1994-Software - Practice and Experience

TL;DR: An implementation of the dynamic programming algorithm for this problem is given that packs several characters and mod‐4 integers into a computer word and a 21‐fold parallelism over the conventional algorithm can be obtained.

...read moreread less

Abstract: Given a text string, a pattern string, and an integer k, the problem of approximate string matching with k differences is to find all substrings of the text string whose edit distance from the pattern string is less than k. The edit distance between two strings is defined as the minimum number of differences, where a difference can be a substitution, insertion, or deletion of a single character. An implementation of the dynamic programming algorithm for this problem is given that packs several characters and mod-4 integers into a computer word. Thus, it is a parallelization of the conventional implementation that runs on ordinary processors. Since a small alphabet means that characters have short binary codes, the degree of parallelism is greatest for small alphabets and for processors with long words. For an alphabet of size 8 or smaller and a 64 bit processor, a 21-fold parallelism over the conventional algorithm can be obtained. Empirical comparisons to the basic dynamic programming algorithm, to a version of Ukkonen's algorithm, to the algorithm of Galil and Park, and to a limited implementation of the Wu-Manber algorithm are given.

...read moreread less

30 citations

Proceedings Article•DOI•

Normalizing the weighted edit distance

[...]

A. Weigel¹, F. Fein•Institutions (1)

German Research Centre for Artificial Intelligence¹

09 Oct 1994

TL;DR: A modified normalized edit distance is presented that expresses the edit distance between two strings X and Y in a more adequate and intuitive way, reflecting the human decision process during comparisons.

...read moreread less

Abstract: In this paper, we discuss the weighted edit distance and two well known normalizations, one based on editing path lengths and one based on the string lengths. We investigate the limitations of these approaches as well as the restrictions on the associated weight function including the triangular inequality. As a solution to the problems pointed out, we present a modified normalized edit distance. The new approach expresses the edit distance between two strings X and Y in a more adequate and intuitive way, reflecting the human decision process during comparisons. A further advantage is that this new distance measure is efficiently computable in O(|X|/spl times/|Y|) instead of O(|X|/spl times/|Y|/spl times/min (|X|,|Y|)) for the other normalizations.

...read moreread less

25 citations

Issues in automatic OCR error classification

[...]

Jeffrey Esakov, Daniel P. Lopresti, Jonathan Sandberg, J. Zhou

31 Dec 1994

TL;DR: The surprising result that OCR errors are not always uniformly distributed across a page is presented, and an algorithm is described based on a well-known dynamic programming approach for determining string edit distance which is extended to handle the types of character segmentation errors inherent to OCR.

...read moreread less

Abstract: In this paper we present the surprising result that OCR errors are not always uniformly distributed across a page. Under certain circumstances, 30% or more of the errors incurred can be attributed to a single, avoidable phenomenon in the scanning process. This observation has important ramifications for work that explicitly or implicitly assumes a uniform error distribution. In addition, our experiments show that not just the quantity but also the nature of the errors is affected. This could have an impact on strategies used for post-process error correction. Results such as these can be obtained only by analyzing large quantities of data in a controlled way. To this end, we also describe our algorithm for classifying OCR errors. This is based on a well-known dynamic programming approach for determining string edit distance which we have extended to handle the types of character segmentation errors inherent to OCR.

...read moreread less

18 citations

Patent•

VLSI circuit structure for determining the edit distance between strings

[...]

Nagarajan Ranganathan¹, Raghu Sastry¹•Institutions (1)

University of South Florida¹

30 Sep 1994

TL;DR: In this paper, a VLSI circuit structure for computing the edit distance between two strings over a given alphabet is presented, which can perform approximate string matching for variable edit costs, and does not place any constraint on the lengths of the strings that can be compared.

...read moreread less

Abstract: The edit distance between two strings a1, . . . , am and b1, . . . , bn is the minimum cost s of a sequence of editing operations (insertions, deletions and substitutions) that convert one string into the other. This invention provides VLSI circuit structure for computing the edit distance between two strings over a given alphabet. The circuit structure can perform approximate string matching for variable edit costs. More importantly, the circuit structure does not place any constraint on the lengths of the strings that can be compared. It makes use of simple basic cells and requires regular nearest-neighbor communication, which makes it suitable for VLSI implementation.

...read moreread less

15 citations

Book Chapter•DOI•

Efficient Algorithm for Learning Simple Regular Expressions from Noisy Examples

[...]

Alvis Brazma¹•Institutions (1)

University of Latvia¹

10 Oct 1994

TL;DR: An efficient algorithm for finding approximate repetitions in a given sequence of characters and provides an algorithm by which the expression can be restored in linear time in the length of the example and no worse than quadratic in thelength of the expression.

...read moreread less

Abstract: We present an efficient algorithm for finding approximate repetitions in a given sequence of characters. First, we define a class of simple regular expressions which are of star-height one and do not contain union operations, and a stochastic mutation process of a given length over a string of characters. Then, assuming that a given string of characters is obtained corrupted by the defined mutation process from some long enough word generated by a simple regular expression, we try to restore the expression. We prove that to within some reasonable accuracy it is always possible if the length of the mutation process is bounded comparing to the length of the example. We provide an algorithm by which the expression can be restored in linear time in the length of the example and no worse than quadratic in the length of the expression. We discuss some extensions of the method and possible applications to bioinformatics.

...read moreread less

9 citations

Journal Article•DOI•

Using Hirschberg's algorithm to generate random alignments of strings

[...]

Lloyd Allison¹•Institutions (1)

Monash University, Clayton campus¹

12 Sep 1994-Information Processing Letters

TL;DR: A simple modification of the Hirschberg alignment algorithm can sample string alignments at random according to their probability distribution, useful for statistical estimation of evolutionary distances of a family of strings, e.g. DNA strings.

...read moreread less