Topic
Edit distance
About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.
Papers published on a yearly basis
Papers
More filters
••
30 Aug 2005TL;DR: The pq-gram distance between ordered labeled trees is defined as an effective and efficient approximation of the well-known tree edit distance and the properties of the pq -gram distance are analyzed to compare it with the edit Distance and alternative approximations.
Abstract: When integrating data from autonomous sources, exact matches of data items that represent the same real world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. As a running example we use residential address information. Addresses are hierarchical structures and are present in many databases. Often they are the best, if not only, relationship between autonomous data sources. Typically the matching has to be approximate since the representations in the sources differ.We propose pq-grams to approximately match hierarchical information from autonomous sources. We define the pq-gram distance between ordered labeled trees as an effective and efficient approximation of the well-known tree edit distance. We analyze the properties of the pq-gram distance and compare it with the edit distance and alternative approximations. Experiments with synthetic and real world data confirm the analytic results and the scalability of our approach.
87 citations
•
14 Jul 2005TL;DR: In this article, a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key-foreign key relationships in a snowflake schema.
Abstract: The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key—foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.
85 citations
••
25 Jul 2003TL;DR: This paper extends El-Mabrouk's work to handle duplications as well as insertions and presents an alternate framework for computing (near) minimal edit sequences involving insertions, deletions, and inversions.
Abstract: As more and more genomes are sequenced, evolutionary biologists are becoming increasingly interested in evolution at the level of whole genomes, in scenarios in which the genome evolves through insertions, deletions, and movements of genes along its chromosomes. In the mathematical model pioneered by Sankoff and others, a unichromosomal genome is represented by a signed permutation of a multi-set of genes; Hannenhalli and Pevzner showed that the edit distance between two signed permutations of the same set can be computed in polynomial time when all operations are inversions. El-Mabrouk extended that result to allow deletions and a limited form of insertions (which forbids duplications). In this paper we extend El-Mabrouk's work to handle duplications as well as insertions and present an alternate framework for computing (near) minimal edit sequences involving insertions, deletions, and inversions. We derive an error bound for our polynomial-time distance computation under various assumptions and present preliminary experimental results that suggest that performance in practice may be excellent, within a few percent of the actual distance.
84 citations
••
20 Oct 2004TL;DR: An algorithm is presented that attempts to select the best choice among all possible corrections for a misspelled term, and its implementation based on a ternary search tree data structure is discussed.
Abstract: Search engines have become the primary means of accessing information on the Web. However, recent studies show misspelled words are very common in queries to these systems. When users misspell query, the results are incorrect or provide inconclusive information. In this work, we discuss the integration of a spelling correction component into tumba!, our community Web search engine. We present an algorithm that attempts to select the best choice among all possible corrections for a misspelled term, and discuss its implementation based on a ternary search tree data structure.
84 citations
••
TL;DR: The Spatio-temporal Edit Distance measure is developed, an extended algorithm to determine the similarity between user trajectories based on call detailed records (CDRs) and performs well for measuring low-resolution tracking information in CDRs, as well as facilitating the interpretation of user mobility patterns in the age of instant access.
Abstract: The rapid development of information and communication technologies ICTs has provided rich data sources for analyzing, modeling, and interpreting human mobility patterns. This paper contributes to this research area by developing the Spatio-temporal Edit Distance measure, an extended algorithm to determine the similarity between user trajectories based on call detailed records CDRs. We improve the traditional Edit Distance algorithm by incorporating both spatial and temporal information into the cost functions. The extended algorithm can preserve both space and time information from string-formatted CDR data. The novel method is applied to a large data set from Northeast China in order to test its effectiveness. Three types of analyses are presented for scenarios with and without the effect of time: 1 Edit Distance with spatial information; 2 Edit Distance with time as a factor in the cost function; and 3 Edit Distance with time as a constraint in partitioning trajectories. The outcomes of this research contribute to both methodological and empirical perspectives. The extended algorithm performs well for measuring low-resolution tracking information in CDRs, as well as facilitating the interpretation of user mobility patterns in the age of instant access.
84 citations