Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

A Normalized Levenshtein Distance Metric

[...]

Li Yujian¹, Liu Bo¹•Institutions (1)

Beijing University of Technology¹

01 Jun 2007-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Experiments using the AESA algorithm in handwritten digit recognition show that the new normalized edit distance between X and Y can generally provide similar results to some other normalized edit distances and may perform slightly better if the triangle inequality is violated in a particular data set.

...read moreread less

Abstract: Although a number of normalized edit distances presented so far may offer good performance in some applications, none of them can be regarded as a genuine metric between strings because they do not satisfy the triangle inequality. Given two strings X and Y over a finite alphabet, this paper defines a new normalized edit distance between X and Y as a simple function of their lengths (|X| and |Y|) and the Generalized Levenshtein Distance (GLD) between them. The new distance can be easily computed through GLD with a complexity of O(|X| \cdot |Y|) and it is a metric valued in [0, 1] under the condition that the weight function is a metric over the set of elementary edit operations with all costs of insertions/deletions having the same weight. Experiments using the AESA algorithm in handwritten digit recognition show that the new distance can generally provide similar results to some other normalized edit distances and may perform slightly better if the triangle inequality is violated in a particular data set.

...read moreread less

624 citations

Proceedings Article•

Approximate String Joins in a Database (Almost) for Free

[...]

Luis Gravano¹, Panagiotis G. Ipeirotis¹, H. V. Jagadish, Nick Koudas², S. Muthukrishnan², Divesh Srivastava² - Show less +2 more•Institutions (2)

Columbia University¹, AT&T²

11 Sep 2001

TL;DR: In this article, the authors propose a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. But this technique relies on matching short substrings of length, called -grams, and taking into account both positions of individual matches and the total number of such matches.

...read moreread less

Abstract: String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data especially for more complex queries involving joins. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string joins directly, and it is a challenge to implement this functionality efficiently with user-defined functions (UDFs). In this paper, we develop a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. At the core, our technique relies on matching short substrings of length , called -grams, and taking into account both positions of individual matches and the total number of such matches. Our approach applies to both approximate full string matching and approximate substring matching, with a variety of possible edit distance functions. The approximate string match predicate, with a suitable edit distance threshold, can be mapped into a vanilla relational expression and optimized by conventional relational optimizers. We demonstrate experimentally the benefits of our technique over the direct use of UDFs, using commercial database systems and real data. To study the I/O and CPU behavior of approximate string join algorithms with variations in edit distance and -gram length, we also describe detailed experiments based on a prototype implementation.

...read moreread less

556 citations

A Comparison of String Metrics for Matching Names and Records

[...]

W. W. Cohen and P. Ravikumar and S. Fienberg

01 Jan 2003

TL;DR: An open-source Java toolkit of methods for matching names and records is described and results obtained from using various string distance metrics on the task of matching entity names are summarized.

...read moreread less

Abstract: We describe an open-source Java toolkit of methods for matching names and records. We summarize results obtained from using various string distance metrics on the task of matching entity names. These metrics include distance functions proposed by several different communities, such as edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. We then describe an extension to the toolkit which allows records to be compared. We discuss some issues involved in performing a similar comparision for record-matching techniques, and finally present results for some baseline record-matching algorithms that aggregate string comparisons between fields.

...read moreread less

552 citations

Proceedings Article•DOI•

Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers

[...]

Ji Gao¹, Jack Lanchantin¹, Mary Lou Soffa¹, Yanjun Qi¹•Institutions (1)

University of Virginia¹

24 May 2018

TL;DR: DeepWordBug as mentioned in this paper generates small text perturbations in a black-box setting that force a deep-learning classifier to misclassify a text input by scoring strategies to find the most important words to modify.

...read moreread less

Abstract: Although various techniques have been proposed to generate adversarial samples for white-box attacks on text, little attention has been paid to a black-box attack, which is a more realistic scenario. In this paper, we present a novel algorithm, DeepWordBug, to effectively generate small text perturbations in a black-box setting that forces a deep-learning classifier to misclassify a text input. We develop novel scoring strategies to find the most important words to modify such that the deep classifier makes a wrong prediction. Simple character-level transformations are applied to the highest-ranked words in order to minimize the edit distance of the perturbation. We evaluated DeepWordBug on two real-world text datasets: Enron spam emails and IMDB movie reviews. Our experimental results indicate that DeepWordBug can reduce the classification accuracy from 99% to 40% on Enron and from 87% to 26% on IMDB. Our results strongly demonstrate that the generated adversarial sequences from a deep-learning model can similarly evade other deep models.

...read moreread less

516 citations

Book Chapter•DOI•

Eliminating fuzzy duplicates in data warehouses

[...]

Rohit Ananthakrishna¹, Surajit Chaudhuri², Venkatesh Ganti²•Institutions (2)

Cornell University¹, Microsoft²

20 Aug 2002

TL;DR: An algorithm for eliminating duplicates in dimensional tables in a data warehouse, which is usually associated with hierarchies is developed and evaluated on real datasets from an operational data warehouse.

...read moreread less

Abstract: The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions. In this paper, we develop an algorithm for eliminating duplicates in dimensional tables in a data warehouse, which are usually associated with hierarchies. We exploit hierarchies to develop a high quality, scalable duplicate elimination algorithm, and evaluate it on real datasets from an operational data warehouse.

...read moreread less

465 citations

Collapse

Network Information

Performance

Metrics

3,030

Papers

78,281

Citations

No. of papers in the topic in previous years
Year	Papers
2023	39
2022	96
2021	111
2020	149
2019	145
2018	139

Edit distance

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics