scispace - formally typeset
Search or ask a question

Showing papers on "Edit distance published in 2001"


Journal ArticleDOI
TL;DR: This work surveys the current techniques to cope with the problem of string matching that allows errors, and focuses on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms.
Abstract: We survey the current techniques to cope with the problem of string matching that allows errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices. We conclude with some directions for future work and open problems.

2,723 citations


Proceedings Article
11 Sep 2001
TL;DR: In this article, the authors propose a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. But this technique relies on matching short substrings of length, called -grams, and taking into account both positions of individual matches and the total number of such matches.
Abstract: String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data especially for more complex queries involving joins. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string joins directly, and it is a challenge to implement this functionality efficiently with user-defined functions (UDFs). In this paper, we develop a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. At the core, our technique relies on matching short substrings of length , called -grams, and taking into account both positions of individual matches and the total number of such matches. Our approach applies to both approximate full string matching and approximate substring matching, with a variety of possible edit distance functions. The approximate string match predicate, with a suitable edit distance threshold, can be mapped into a vanilla relational expression and optimized by conventional relational optimizers. We demonstrate experimentally the benefits of our technique over the direct use of UDFs, using commercial database systems and real data. To study the I/O and CPU behavior of approximate string join algorithms with variations in edit distance and -gram length, we also describe detailed experiments based on a prototype implementation.

556 citations


Proceedings ArticleDOI
02 Jun 2001
TL;DR: Substantial portions of translation lexicons can be generated accurately for languages where no bilingual dictionary or parallel corpora may exist, up to 95% exact match accuracy.
Abstract: This paper presents a method for inducing translation lexicons based on transduction models of cognate pairs via bridge languages. Bilingual lexicons within languages families are induced using probabilistic string edit distance models. Translation lexicons for arbitrary distant language pairs are then generated by a combination of these intra-family translation models and one or more cross-family on-line dictionaries. Up to 95% exact match accuracy is achieved on the target vocabulary (30-68% of inter-family test pairs). Thus substantial portions of translation lexicons can be generated accurately for languages where no bilingual dictionary or parallel corpora may exist.

196 citations


Journal ArticleDOI
TL;DR: This paper precisely defines approximate multiple repeats, and presents an algorithm that finds all repeats that concur with the definition, and the time complexity of the algorithm, when searching for repeats with up to k errors in a string S of length n, is O(nka log (n/k), where a is the maximum number of periods in any reported repeat.
Abstract: A perfect single tandem repeat is defined as a nonempty string that can be divided into two identical substrings, e.g., abcabc. An approximate single tandem repeat is one in which the substrings are similar, but not identical, e.g., abcdaacd. In this paper we consider two criterions of similarity: the Hamming distance (k mismatches) and the edit distance (k differences). For a string S of length n and an integer k our algorithm reports all locally optimal approximate repeats, r = umacro u, for which the Hamming distance of umacro and u is at most k, in O(nk log (n/k)) time, or all those for which the edit distance of umacro and u is at most k, in O(nk log k log (n/k)) time. This paper concentrates on a more general type of repeat called multiple tandem repeats. A multiple tandem repeat in a sequence S is a (periodic) substring r of S of the form r = u(a)u', where u is a prefix of r and u' is a prefix of u. An approximate multiple tandem repeat is a multiple repeat with errors; the repeated subsequences are similar but not identical. We precisely define approximate multiple repeats, and present an algorithm that finds all repeats that concur with our definition. The time complexity of the algorithm, when searching for repeats with up to k errors in a string S of length n, is O(nka log (n/k)) where a is the maximum number of periods in any reported repeat. We present some experimental results concerning the performance and sensitivity of our algorithm. The problem of finding repeats within a string is a computational problem with important applications in the field of molecular biology. Both exact and inexact repeats occur frequently in the genome, and certain repeats occurring in the genome are known to be related to diseases in the human.

165 citations


Proceedings Article
11 Sep 2001
TL;DR: This paper proposes to map the substrings of the data into an integer space with the help of wavelet coefficients, and defines a distance function which is a lower bound to the actual edit distance between strings.
Abstract: We consider the problem of substring searching in large databases. Typical applications of this problem are genetic data, web data, and event sequences. Since the size of such databases grows exponentially, it becomes impractical to use inmemory algorithms for these problems. In this paper, we propose to map the substrings of the data into an integer space with the help of wavelet coefficients. Later, we index these coefficients using MBRs (Minimum Bounding Rectangles). We define a distance function which is a lower bound to the actual edit distance between strings. We experiment with both nearest neighbor queries and range queries. The results show that our technique prunes significant amount of the database (typically 50-95%), thus reducing both the disk I/O cost and the CPU cost significantly .

137 citations


Journal Article
TL;DR: This paper develops a technique for building approximate string processing capabilities on top of commercial databases by exploiting facilities already available in them by relying on generating short substrings of length q, called q-grams, and processing them using standard methods available in the DBMS.
Abstract: String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string queries directly, and it is a challenge to implement this functionality efficiently with user-defined functions (UDFs). In this paper, we develop a technique for building approximate string processing capabilities on top of commercial databases by exploiting facilities already available in them. At the core, our technique relies on generating short substrings of length q, called q-grams, and processing them using standard methods available in the DBMS. The proposed technique enables various approximate string processing methods in a DBMS, for example approximate (sub)string selections and joins, and can even be used with a variety of possible edit distance functions. The approximate string match predicate, with a suitable edit distance threshold, can be mapped into a vanilla relational expression and optimized by conventional relational optimizers.

117 citations


Patent
31 Mar 2001
TL;DR: In this paper, the spell checking of a word corresponding to a typically numeric key sequence entered by the user using numeric keys or other reduced keyboards is disclosed, based on comparisons of the entered number sequences with number sequences within a dictionary, or number sequences for words within the dictionary.
Abstract: Spell checking of a word corresponding to a typically numeric key sequence entered by the user using numeric keys or other reduced keyboards is disclosed. The spell checking is based on comparisons of the entered number sequences with number sequences within a dictionary, or number sequences for words within a dictionary. For a given entered number sequence, the number sequences of words in a dictionary, or the number sequences in the dictionary, are compared. Those having costs according to a metric not greater than a maximum cost are presented as the potential intended word of the user. The metric may be the minimum edit distance, for example.

90 citations


01 Nov 2001
TL;DR: This paper describes a method of extracting katakana words and phrases, along with their English counterparts from non-aligned monolingual web search engine query logs, using a trainable edit distance function to find pairs that have a high probability of being equivalent.
Abstract: This paper describes a method of extracting katakana words and phrases, along with their English counterparts from non-aligned monolingual web search engine query logs. The method employs a trainable edit distance function to find pairs that have a high probability of being equivalent. These pairs can then be used to further bootstrap training of the edit distance function, resulting in improved back-transliteration from katakana to English. In addition, this is an effective method for mining large numbers of katakana strings to enhance a bilingual lexicon. The improved edit distance function and enhanced lexicon can be used for more accurate alignment of bitexts, and for application during runtime MT and multilingual IR.

75 citations


18 Sep 2001
TL;DR: An automatic ranking method that encodes machine-translated sentences with a rank assigned by humans into multi-dimensional vectors from which a classifier of ranks is learned in the form of a decision tree (DT).
Abstract: This paper addresses the challenging problem of automatically evaluating output from machine translation (MT) systems in order to support the developers of these systems. Conventional approaches to the problem include methods that automatically assign a rank such as A, B, C, or D to MT output according to a single edit distance between this output and a correct translation example. The single edit distance can be differently designed, but changing its design makes assigning a certain rank more accurate, but another rank less accurate. This inhibits improving accuracy of rank assignment. To overcome this obstacle, this paper proposes an automatic ranking method that, by using multiple edit distances, encodes machine-translated sentences with a rank assigned by humans into multi-dimensional vectors from which a classifier of ranks is learned in the form of a decision tree (DT). The proposed method assigns a rank to MT output through the learned DT. The proposed method is evaluated using transcribed texts of real conversations in the travel arrangement domain. Experimental results show that the proposed method is more accurate than the single-edit-distance-based ranking methods, in both closed and open tests. Moreover, the proposed method could estimate MT quality within 3% error in some cases.

70 citations


Proceedings ArticleDOI
09 Jan 2001
TL;DR: This work defines costs for the edit-operations and gives an algorithm for computing them, and shows that this approach performs intuitively in categorization and indexing tasks, and its results are better than previous approaches.
Abstract: We report on our experience with the implementation of an algorithm for comparing shapes by computing the edit-distance between their medial axes. A shape-comparison method that is robust to various visual transformations has several applications in computer vision, including organizing and querying an image database, and object recognition.There are two components to research on this problem, mathematical formulation of the shape-comparison problem and the computational solution method. We have a clear, well-defined formulation and polynomial-time algorithms for solution. Previous research has involved either ill-defined formulations or heuristic methods for solution.Our starting-point for the implementation is the edit-distance algorithm of Klein et al. [6]. We discuss how we altered that algorithm to handle rotation-invariance while keeping down the time and storage requirements. Most important, we define costs for the edit-operations and give an algorithm for computing them.We use a database of shapes to illustrates that our approach performs intuitively in categorization and indexing tasks, and our results are better than previous approaches.

66 citations


Journal ArticleDOI
TL;DR: A dynamic programming algorithm is presented to solve the problem based on the distance measure originated from Tanaka and Tanaka as fast as the best-known algorithm for comparing two trees using Tanaka's distance measure when the allowed distance between the common substructures is a constant independent of the input trees.

Book ChapterDOI
11 Mar 2001
TL;DR: It is argued that with these new concepts various well-established techniques from statistical pattern recognition become applicable in the structural domain, particularly to graph representations, including k-means clustering, vector quantization, and Kohonen maps.
Abstract: Two novel concepts in structural pattern recognition are discussed in this paper The first, median of a set of graphs, can be used to characterize a set of graphs by just a single prototype Such a characterization is needed in various tasks, for example, in clustering The second novel concept is weighted mean of a pair of graphs It can be used to synthesize a graph that has a specified degree of similarity, or distance, to each of a pair of given graphs Such an operation is needed in many machine learning tasks It is argued that with these new concepts various well-established techniques from statistical pattern recognition become applicable in the structural domain, particularly to graph representations Concrete examples include k-means clustering, vector quantization, and Kohonen maps

Journal ArticleDOI
TL;DR: The weighted mean of G and G′ is a graph G″ that has edit distances d(G, G″) and d (G″, G′) to G andG′, respectively, such that d( g, g″) + d(g″, g′) = d( G,G′).
Abstract: Graph matching and graph edit distance are fundamental concepts in structural pattern recognition. In this paper, the weighted mean of a pair of graphs is introduced. Given two graphs, G and G', with d(G, G') being the edit distance of G and G', the weighted mean of G and G' is a graph G'' that has edit distances d(,i>G, G) and (G'', G') to G and G', respectively, such that d(G, G'') + d(G'', G') = d(G, G'). We'll show formal properties of the weighted mean, describe a procedure for its computation, and give examples.

Book ChapterDOI
05 Oct 2001
TL;DR: This chapter describes in detail one relational distance measure that has proven very successful in applications, and introduces three systems that actually carry out relational distance-based learning and clustering: RIBL2, RDBC and FORC.
Abstract: Within data analysis, distance-based methods have always been very popular. Such methods assume that it is possible to compute for each pair of objects in a domain their mutual distance (or similarity). In a distance-based setting, many of the tasks usually considered in data mining can be carried out in a surprisingly simple yet powerful way. In this chapter, we give a tutorial introduction to the use of distance-based methods for relational representations, concentrating in particular on predictive learning and clustering. We describe in detail one relational distance measure that has proven very successful in applications, and introduce three systems that actually carry out relational distance-based learning and clustering: RIBL2, RDBC and FORC. We also present a detailed case study of how these three systems were applied to a domain from molecular biology.

Proceedings ArticleDOI
05 Oct 2001
TL;DR: A new indexing and ranking scheme using metaphones and a Bayesian phonetic edit distance is presented, showing improvement of up to 15% in precision compare to results obtained speech recognition alone, at a processing time of 0.5 Sec per query.
Abstract: Phonetic speech retrieval is used to augment word based retrieval in spoken document retrieval systems, for in and out of vocabulary words. In this paper, we present a new indexing and ranking scheme using metaphones and a Bayesian phonetic edit distance. We conduct an extensive set of experiments using a hundred hours of HUB4 data with ground truth transcript and twenty-four thousands query words. We show improvement of up to 15% in precision compare to results obtained speech recognition alone, at a processing time of 0.5 Sec per query.

Patent
26 Jul 2001
TL;DR: In this article, a pattern is partitioned into context and value components, and candidate matches for each of the components is identified by calculating an edit distance between that component and each potentially matching set (sub-string) of symbols within the string.
Abstract: A system and method for examining a string of symbols and identifying portions of the string which match a predetermined pattern using adaptively weighted, partitioned context edit distances. A pattern is partitioned into context and value components, and candidate matches for each of the components is identified by calculating an edit distance between that component and each potentially matching set (sub-string) of symbols within the string. One or more candidate matches having the lowest edit distances are selected as matches for the pattern. The weighting of each of the component matches may be adapted to optimize the pattern matching and, in one embodiment, the context components may be heavily weighted to obtain matches of a value for which the corresponding pattern is not well defined. In one embodiment, an edit distance matrix is evaluated for each of a prefix component, a value component and a suffix component of a pattern. The evaluation of the prefix matrix provides a basis for identifying indicators of the beginning of a value window, while the evaluation of the suffix matrix provides a basis for identifying the alignment of the end of the value window. The value within the value window can then be evaluated via the value matrix to determine a corresponding value match score.

Journal ArticleDOI
TL;DR: This paper presents a parallel algorithm for computing the edit distance for the class of languages accepted by one-way nondeterministic auxiliary pushdown automata working in polynomial time, a class that strictly contains context?free languages.
Abstract: The notion of edit distance arises in very different fields such as self-correcting codes, parsing theory, speech recognition, and molecular biology. The edit distance between an input string and a language L is the minimum cost of a sequence of edit operations (substitution of a symbol in another incorrect symbol, insertion of an extraneous symbol, deletion of a symbol) needed to change the input string into a sentence of L. In this paper we study the complexity of computing the edit distance, discovering sharp boundaries between classes of languages for which this function can be efficiently evaluated and classes of languages for which it seems to be difficult to compute. Our main result is a parallel algorithm for computing the edit distance for the class of languages accepted by one-way nondeterministic auxiliary pushdown automata working in polynomial time, a class that strictly contains context?free languages. Moreover, we show that this algorithm can be extended in order to find a sentence of the language from which the input string has minimum distance.

Proceedings ArticleDOI
22 Apr 2001
TL;DR: The notion of edit distance is proposed to measure the similarity between two RNA secondary and tertiary structures, by incorporating the various edit operations performing on both bases and arcs (base-pairs).
Abstract: Arc-annotated sequences are useful in representiug the structural information of RNA sequences. Typically, RNA secondary and tertiary structures could be represented by a set of nested arcs and a set of crossing arcs, respectively. As the specified RNA functions are determined by the specified molecular confirmation and therefore the specified secondary and tertiary structures, the comparison between RNA secondary and tertiary structures have received much attention recently. In this paper, we propose the notion of edit distance to measure the similarity between two RNA secondary and tertiary structures, by incorporating the various edit operations performing on both bases and arcs (base-pairs). Several algorithms are presented to compute the edit distance two RNA sequences with various arc structures and under various score schemes, either exactly or approximately. Preliminary experimental tests confirm that our definition of edit distance and the computation model are among the most reasonable ones ever studied in the literature.

Book ChapterDOI
20 Aug 2001
TL;DR: A new measure of the edit distance between two rooted labeled trees is defined, called less-constrained edit distance, by relaxing the restriction of constrained edit mapping, and it is shown that this problem is NP-complete and even has no absolute approximation algorithm unless P = NP, which implies that it is impossible to have a PTAS for the problem.
Abstract: One of the most important problem in computational biology is the tree editing problem which is to determine the edit distance between two rooted labeled trees. It has been shown to have significant applications in both RNA secondary structures and evolutionary trees. Another viewpoint of considering this problem is to find an edit mapping with the minimum cost. By restricting the type of mapping, Zhang [7,8] and Richter [5] independently introduced the constrained edit distance and the structure respecting distance, respectively. They are, in fact, the same concept. In this paper, we define a new measure of the edit distance between two rooted labeled trees, called less-constrained edit distance, by relaxing the restriction of constrained edit mapping. Then we study the algorithmic complexities of computing the less-constrained edit distance between two rooted labeled trees. For unordered labeled trees, we show that this problem is NP-complete and even has no absolute approximation algorithm unless P = NP, which also implies that it is impossible to have a PTAS for the problem. For ordered labeled trees, we give a polynomial-time algorithm to solve the problem.

Book ChapterDOI
23 Jul 2001
TL;DR: A method of error-tolerant lookup in a finite-state lexicon is described, as well as its application to automatic spelling correction, to retain only the most similar corrections (nearest neighbours) and to reach the first correction as soon as possible.
Abstract: A method of error-tolerant lookup in a finite-state lexicon is described, as well as its application to automatic spelling correction. We compare our method to the algorithm by K. Oflazer [14]. While Oflazer's algorithm searches for all possible corrections of a misspelled word that are within a given similarity threshold, our approach is to retain only the most similar corrections (nearest neighbours), reducing dynamically the search space in the lexicon, and to reach the first correction as soon as possible.

Proceedings ArticleDOI
29 Nov 2001
TL;DR: The FlExPat algorithm is designed to satisfactorily cope with the trade-off between flexibility, particularly in sequence data representation and in associated similarity metrics, and computational efficiency, and some experimental results obtained with FlExpat on music data are presented and commented.
Abstract: This paper addresses sequential data mining, a sub-area of data mining where the data to be analyzed is organized in sequences. In many problem domains a natural ordering exists over data. Examples of sequential databases (SDBs) include: (a) collections of temporal data sequences, such as chronological series of daily stock indices or multimedia data (sound, music, video, etc.); and (b) macromolecule banks, where amino acid or proteic sequences are represented as strings. In a SDB it is often valuable to detect regularities through one or several sequences. In particular, finding exact or approximate repetitions of segments can be utilized directly (e.g. for determining the biochemical activity of a protein region) or indirectly, e.g. for prediction in finance. To this end, we present concepts and an algorithm for automatically extracting sequential patterns from a sequential database. Such a pattern is defined as a group of significantly similar segments from one or several sequences. Appropriate functions for measuring similarity between sequence segments are proposed, generalizing the edit distance framework. There is a trade off between flexibility, particularly in sequence data representation and in associated similarity metrics, and computational efficiency. We designed the FlExPat algorithm to satisfactorily cope with this trade-off. FlExPat's complexity is in practice lesser than quadratic in the total length of the SDB analyzed, while allowing high flexibility. Some experimental results obtained with FlExPat on music data are presented and commented.

Journal ArticleDOI
TL;DR: This paper considers the edit distance between two RNA structures, a notion of similarity, introduced in [Proceedings of the Tenth Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science, vol. 1645], taking into account the primary, the secondary and the tertiary structures.

01 Jan 2001
TL;DR: A measure of the similarity of the long-term structure of musical pieces is presented, which can be matched to other similar scores using a generalized edit distance, in order to assess structural similarity.
Abstract: We present a measure of the similarity of the long-term structure of musical pieces. The system deals with raw polyphonic data. Through unsupervised learning, we generate an abstract representation of music the “texture score”. This “texture score” can be matched to other similar scores using a generalized edit distance, in order to assess structural similarity. We notably apply this algorithm to the retrieval of different interpretations of the same song within a music database.

Proceedings ArticleDOI
13 Nov 2001
TL;DR: The main motivation for these methods is two and higher dimensional point-pattern matching, and therefore they generalize these methods into the 20 case, and it is shown that this generalization leads to an NP-complete problem.
Abstract: Edit distance is a powerful measure of similarity in string matching, measuring the minimum amount of insertions, deletions, and substitutions to convert a string into another string. This measure is ofte. contrasted with time warping in speech processing, that measures how close two trajectories are by allowing compression and expansion operations on time scale. Erne warping can be easily generalized to measure the similarity between ID point-patterns (ascending lists of real values), as the diference between ith and (i - l)th points in a point-pattern can be considered as the value of a trajectory at the time i. Howeve< we show that edit distance is more natural choice, and derive a measure by calculating the minimum amount of space needed to insert and delete between points to convert a point-pattern into another. We show that this measure defines a metric. We also define a substitution operation such that the distance calculation automatically separates the points into matching and mismatching points. The algorithms are based on dynamic programming. The main motivation for these methods is two and higher dimensional point-pattern matching, and therefore we generalize these methods into the 20 case, and show that this generalization leads to an NP-complete problem. There is also applications for the ID case; we discuss shortly the matching of tree ring sequences in dendrochronology.

Book ChapterDOI
TL;DR: This paper presents an elegant and veryeasy to implement bit-vector algorithm for answering the following incremental version of the approximate string matching problem: given an appropriate encoding of a comparison between A and bB, can one compute the answer for A and B with equal efficiency?
Abstract: The approximate string matching problem is to find all locations which a pattern of length m matches a substring of a text of length n with at most k differences. The program agrep is a simple and practical bit-vector algorithm for this problem. In this paper we consider the following incremental version of the problem: given an appropriate encoding of a comparison between A and bB, can one compute the answer for A and B, and the answer for A and Bc with equal efficiency, where b and c are additional symbols? Here we present an elegant and veryeasy to implement bit-vector algorithm for answering these questions that requires only O(n⌈m/w⌉) time, where n is the length of A, m is the length of B and w is the number of bits in a machine word. We also present an O(nm⌈h/w⌉) algorithm for the fixed-length approximate string matching problem: given a text t, a pattern p and an integer h, compute the optimal alignment of all substrings of p of length h and a substring of t.

Journal ArticleDOI
TL;DR: A constraint based approach for parametric sequence alignment is proposed which allows for more general string alignment queries where the alignment cost can itself be parameterized as a query with some initial constraints.
Abstract: Approximate matching techniques based on string alignment are important tools for investigating similarities between strings, such as those representing DNA and protein sequences We propose a constraint based approach for parametric sequence alignment which allows for more general string alignment queries where the alignment cost can itself be parameterized as a query with some initial constraints Thus, the costs need not be fixed in a parametric alignment query unlike the case in normal alignment The basic dynamic programming string edit distance algorithm is generalized to a naive algorithm which uses inequalities to represent the alignment score The naive algorithm is rather costly and the remainder of the paper develops an improvement which prunes alternatives where it can and approximates the alternatives otherwise This reduces the number of inequalities significantly and strengthens the constraint representation with equalities We present some preliminary results using parametric alignment on some general alignment queries

Book ChapterDOI
01 Jul 2001
TL;DR: In this paper, the authors define a fuzzy Hamming distance that extends the Hamming concept to give partial credit for near misses, and suggest a dynamic programming algorithm that permits it to be computed efficiently.
Abstract: Many problems depend on a reliable measure of the distance or similarity between objects that, frequently, are represented as vectors. We consider here vectors that can be expressed as bit sequences. For such problems, the most heavily used measure is the Hamming distance, perhaps normalized. The value of Hamming distances is limited by the fact that it counts only exact matches, whereas in various applications, corresponding bits that are close by, but not exactly matched, can still be considered to be almost identical. We here define a “fuzzy Hamming distance” that extends the Hamming concept to give partial credit for near misses, and suggest a dynamic programming algorithm that permits it to be computed efficiently. We envision many uses for such a measure.

Proceedings Article
01 Jan 2001
TL;DR: A comparison of two systems for correcting spelling errors resulting in non-existent words (i.e. not listed in any lexicon) shows the improvements brought by the second approach.
Abstract: We report on the comparison of two systems for correcting spelling errors resulting in non-existent words (i.e. not listed in any lexicon). Both systems aim at improving edition of medical reports. Unlike traditional systems, based on word language models, both semantic and syntactic contexts are considered here. Both systems share the same string-to-string edit distance module, and the same contextual disambiguation principles. The differences between the two systems are located at the user interaction level: while the first system is using exclusively the left context, simulating the underlining of every mis-spelling at the end of every word typing, the second system uses the left as well as the right context and simulate a post-edition correction, when asked by the author. Our conclusion shows the improvements brought by the second approach.

Book ChapterDOI
05 Sep 2001
TL;DR: A new algorithm for pattern extraction from Stratified Ordered Trees (SOT) is proposed, aiming to detect recurrent syntactical motives in texts drawn from classical literature.
Abstract: This paper proposes a new algorithm for pattern extraction from Stratified Ordered Trees (SOT). It first describes the SOT data structure that renders possible a representation of structured sequential data. Then it shows how it is possible to extract clusters of similar recurrent patterns from any SOT. The similarity on which our clustering algorithm is based is a generalized edit distance, also described in the paper. The algorithms presented have been tested on text mining: the aim was to detect recurrent syntactical motives in texts drawn from classical literature. Hopefully, this algorithm can be applied to many different fields where data are naturally sequential (e.g. financial data, molecular biology, traces of computation, etc.)

Journal Article
TL;DR: Based on all of the experiments and analysis in this paper, the efficiency, rationality and scientificity of the N Gram based approximately duplicate record detecting approach is validated.
Abstract: Eliminating duplications in large databases has drawn great attention, and it gradually becomes a hot issue in the research of data quality. In this paper, the problem is studied and an efficient N Gram based approach for detecting approximately duplicate database records is proposed. The contributions of this paper are: (1) An efficient N Gram based clustering algorithm is proposed, which can tolerate the most common types of spelling mistakes and cluster those approximately duplicated records together. Besides, an improved N Gram based algorithm is proposed as well, which could not only detect those duplications but also revise most of the insertion and deletion spelling mistakes in the words of records automatically. The advantage is that the N Gram based algorithm is only with the computing complexity of O(N). (2) A very efficient application independent Pair Wise comparison algorithm based on the edit distance is exploited. It verifies whether two records are approximately duplicate records via computing the edit distance between each pair of words in the two records. (3) For detecting approximately duplicate records, an improved algorithm that employs the priority queue is presented. It uses a priority queue to scan all sorted records sequentially, and makes those approximately duplicate records cluster together through comparing the distance between current record and the corresponding record in the priority queue. Furthermore, an effective experimental environment is set up and a lot of algorithm tests are carried out. Plenty of results are produced through a great deal of different actual experiments. The corresponding aborative analysis is also presented here. Based on all of the experiments and analysis in this paper, the efficiency, rationality and scientificity of the N Gram based approximately duplicate record detecting approach is validated.