scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Proceedings ArticleDOI
30 Aug 2005
TL;DR: The pq-gram distance between ordered labeled trees is defined as an effective and efficient approximation of the well-known tree edit distance and the properties of the pq -gram distance are analyzed to compare it with the edit Distance and alternative approximations.
Abstract: When integrating data from autonomous sources, exact matches of data items that represent the same real world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. As a running example we use residential address information. Addresses are hierarchical structures and are present in many databases. Often they are the best, if not only, relationship between autonomous data sources. Typically the matching has to be approximate since the representations in the sources differ.We propose pq-grams to approximately match hierarchical information from autonomous sources. We define the pq-gram distance between ordered labeled trees as an effective and efficient approximation of the well-known tree edit distance. We analyze the properties of the pq-gram distance and compare it with the edit distance and alternative approximations. Experiments with synthetic and real world data confirm the analytic results and the scalability of our approach.

87 citations

Patent
14 Jul 2005
TL;DR: In this article, a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key-foreign key relationships in a snowflake schema.
Abstract: The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key—foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.

85 citations

Journal ArticleDOI
25 Jul 2003
TL;DR: This paper extends El-Mabrouk's work to handle duplications as well as insertions and presents an alternate framework for computing (near) minimal edit sequences involving insertions, deletions, and inversions.
Abstract: As more and more genomes are sequenced, evolutionary biologists are becoming increasingly interested in evolution at the level of whole genomes, in scenarios in which the genome evolves through insertions, deletions, and movements of genes along its chromosomes. In the mathematical model pioneered by Sankoff and others, a unichromosomal genome is represented by a signed permutation of a multi-set of genes; Hannenhalli and Pevzner showed that the edit distance between two signed permutations of the same set can be computed in polynomial time when all operations are inversions. El-Mabrouk extended that result to allow deletions and a limited form of insertions (which forbids duplications). In this paper we extend El-Mabrouk's work to handle duplications as well as insertions and present an alternate framework for computing (near) minimal edit sequences involving insertions, deletions, and inversions. We derive an error bound for our polynomial-time distance computation under various assumptions and present preliminary experimental results that suggest that performance in practice may be excellent, within a few percent of the actual distance.

84 citations

Book ChapterDOI
20 Oct 2004
TL;DR: An algorithm is presented that attempts to select the best choice among all possible corrections for a misspelled term, and its implementation based on a ternary search tree data structure is discussed.
Abstract: Search engines have become the primary means of accessing information on the Web. However, recent studies show misspelled words are very common in queries to these systems. When users misspell query, the results are incorrect or provide inconclusive information. In this work, we discuss the integration of a spelling correction component into tumba!, our community Web search engine. We present an algorithm that attempts to select the best choice among all possible corrections for a misspelled term, and discuss its implementation based on a ternary search tree data structure.

84 citations

Journal ArticleDOI
TL;DR: The Spatio-temporal Edit Distance measure is developed, an extended algorithm to determine the similarity between user trajectories based on call detailed records (CDRs) and performs well for measuring low-resolution tracking information in CDRs, as well as facilitating the interpretation of user mobility patterns in the age of instant access.
Abstract: The rapid development of information and communication technologies ICTs has provided rich data sources for analyzing, modeling, and interpreting human mobility patterns. This paper contributes to this research area by developing the Spatio-temporal Edit Distance measure, an extended algorithm to determine the similarity between user trajectories based on call detailed records CDRs. We improve the traditional Edit Distance algorithm by incorporating both spatial and temporal information into the cost functions. The extended algorithm can preserve both space and time information from string-formatted CDR data. The novel method is applied to a large data set from Northeast China in order to test its effectiveness. Three types of analyses are presented for scenarios with and without the effect of time: 1 Edit Distance with spatial information; 2 Edit Distance with time as a factor in the cost function; and 3 Edit Distance with time as a constraint in partitioning trajectories. The outcomes of this research contribute to both methodological and empirical perspectives. The extended algorithm performs well for measuring low-resolution tracking information in CDRs, as well as facilitating the interpretation of user mobility patterns in the age of instant access.

84 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139