Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Approximate matching of hierarchical data using pq -grams

[...]

Nikolaus Augsten¹, Michael H. Böhlen¹, Johann Gamper¹•Institutions (1)

Free University of Bozen-Bolzano¹

30 Aug 2005

TL;DR: The pq-gram distance between ordered labeled trees is defined as an effective and efficient approximation of the well-known tree edit distance and the properties of the pq -gram distance are analyzed to compare it with the edit Distance and alternative approximations.

...read moreread less

Abstract: When integrating data from autonomous sources, exact matches of data items that represent the same real world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. As a running example we use residential address information. Addresses are hierarchical structures and are present in many databases. Often they are the best, if not only, relationship between autonomous data sources. Typically the matching has to be approximate since the representations in the sources differ.We propose pq-grams to approximately match hierarchical information from autonomous sources. We define the pq-gram distance between ordered labeled trees as an effective and efficient approximation of the well-known tree edit distance. We analyze the properties of the pq-gram distance and compare it with the edit distance and alternative approximations. Experiments with synthetic and real world data confirm the analytic results and the scalability of our approach.

...read moreread less

87 citations

Patent•

Detecting duplicate records in databases

[...]

Surajit Chaudhuri¹, Venkatesh Ganti¹, Rohit Ananthakrishna¹•Institutions (1)

Microsoft¹

14 Jul 2005

TL;DR: In this article, a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key-foreign key relationships in a snowflake schema.

...read moreread less

Abstract: The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key—foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.

...read moreread less

85 citations

Journal Article•DOI•

Genomic distances under deletions and insertions

[...]

Mark Marron¹, Krister M. Swenson¹, Bernard M. E. Moret¹•Institutions (1)

University of New Mexico¹

25 Jul 2003

TL;DR: This paper extends El-Mabrouk's work to handle duplications as well as insertions and presents an alternate framework for computing (near) minimal edit sequences involving insertions, deletions, and inversions.

...read moreread less

Abstract: As more and more genomes are sequenced, evolutionary biologists are becoming increasingly interested in evolution at the level of whole genomes, in scenarios in which the genome evolves through insertions, deletions, and movements of genes along its chromosomes. In the mathematical model pioneered by Sankoff and others, a unichromosomal genome is represented by a signed permutation of a multi-set of genes; Hannenhalli and Pevzner showed that the edit distance between two signed permutations of the same set can be computed in polynomial time when all operations are inversions. El-Mabrouk extended that result to allow deletions and a limited form of insertions (which forbids duplications). In this paper we extend El-Mabrouk's work to handle duplications as well as insertions and present an alternate framework for computing (near) minimal edit sequences involving insertions, deletions, and inversions. We derive an error bound for our polynomial-time distance computation under various assumptions and present preliminary experimental results that suggest that performance in practice may be excellent, within a few percent of the actual distance.

...read moreread less

84 citations

Book Chapter•DOI•

Spelling Correction for Search Engine Queries

[...]

Bruno Martins¹, Mário J. Silva¹•Institutions (1)

University of Lisbon¹

20 Oct 2004

TL;DR: An algorithm is presented that attempts to select the best choice among all possible corrections for a misspelled term, and its implementation based on a ternary search tree data structure is discussed.

...read moreread less

Abstract: Search engines have become the primary means of accessing information on the Web. However, recent studies show misspelled words are very common in queries to these systems. When users misspell query, the results are incorrect or provide inconclusive information. In this work, we discuss the integration of a spelling correction component into tumba!, our community Web search engine. We present an algorithm that attempts to select the best choice among all possible corrections for a misspelled term, and discuss its implementation based on a ternary search tree data structure.

...read moreread less

84 citations

Journal Article•DOI•

[...]

Yihong Yuan¹, Martin Raubal²•Institutions (2)

University of California, Santa Barbara¹, ETH Zurich²

01 Mar 2014-International Journal of Geographical Information Science

TL;DR: The Spatio-temporal Edit Distance measure is developed, an extended algorithm to determine the similarity between user trajectories based on call detailed records (CDRs) and performs well for measuring low-resolution tracking information in CDRs, as well as facilitating the interpretation of user mobility patterns in the age of instant access.

...read moreread less

Abstract: The rapid development of information and communication technologies ICTs has provided rich data sources for analyzing, modeling, and interpreting human mobility patterns. This paper contributes to this research area by developing the Spatio-temporal Edit Distance measure, an extended algorithm to determine the similarity between user trajectories based on call detailed records CDRs. We improve the traditional Edit Distance algorithm by incorporating both spatial and temporal information into the cost functions. The extended algorithm can preserve both space and time information from string-formatted CDR data. The novel method is applied to a large data set from Northeast China in order to test its effectiveness. Three types of analyses are presented for scenarios with and without the effect of time: 1 Edit Distance with spatial information; 2 Edit Distance with time as a factor in the cost function; and 3 Edit Distance with time as a constraint in partitioning trajectories. The outcomes of this research contribute to both methodological and empirical perspectives. The extended algorithm performs well for measuring low-resolution tracking information in CDRs, as well as facilitating the interpretation of user mobility patterns in the age of instant access.

...read moreread less

84 citations

Collapse

Network Information

Performance

Metrics

3,030

Papers

78,281

Citations

No. of papers in the topic in previous years
Year	Papers
2023	39
2022	96
2021	111
2020	149
2019	145
2018	139

Edit distance

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics