scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
30 Jan 2007
TL;DR: This paper explores methods for making pass-phrases suitable for use with password-based authentication and key-exchange (PAKE) protocols, and in particular, with schemes resilient to server-file compromise.
Abstract: It is well understood that passwords must be very long and complex to have sufficient entropy for security purposes. Unfortunately, these passwords tend to be hard to memorize, and so alternatives are sought. Smart Cards, Biometrics, and Reverse Turing Tests (human-only solvable puzzles) are options, but another option is to use pass-phrases. This paper explores methods for making pass-phrases suitable for use with password-based authentication and key-exchange (PAKE) protocols, and in particular, with schemes resilient to server-file compromise. In particular, the Ω-method of Gentry, MacKenzie and Ramzan, is combined with the Bellovin-Merritt protocol to provide mutual authentication (in the random oracle model (Canetti, Goldreich & Halevi 2004, Bellare, Boldyreva & Palacio 2004, Maurer, Renner & Holenstein 2004)). Furthermore, since common password-related problems are typographical errors, and the CAPSLOCK key, we show how a dictionary can be used with the Damerau-Levenshtein string-edit distance metric to construct a case-insensitive pass-phrase system that can tolerate zero, one, or two spelling-errors per word, with no loss in security. Furthermore, we show that the system can be made to accept pass-phrases that have been arbitrarily reordered, with a security cost that can be calculated. While a pass-phrase space of 2128 is not achieved by this scheme, sizes in the range of 252 to 2112 result from various selections of parameter sizes. An attacker who has acquired the server-file must exhaust over this space, while an attacker without the server-file cannot succeed with non-negligible probability.

82 citations

Book ChapterDOI
05 Jul 1995
TL;DR: This work focuses on the case in which T is fixed and preprocessed in linear time, while P and k vary over consecutive searches, and gives an O(mq+t vanocc) time and O(q) space algorithm, where q≤n depends on the problem instance, and t vanocc is the size of the output.
Abstract: Let T be a text of length n and P a pattern of length m, both strings over a fixed finite alphabet σ. We wish to find all approximate occurrences of P in T having weighted edit distance at most k from P: this is the approximate substring matching problem. We focus on the case in which T is fixed and preprocessed in linear time, while P and k vary over consecutive searches. We give an O(mq+t vanocc ) time and O(q) space algorithm, where q≤n depends on the problem instance, and t vanocc is the size of the output. The running time is proportional to the amount of matching, in the worst case as fast as standard dynamic programming. The algorithm uses the suffix tree representation of the text. The best previous algorithm requires O(mq log q+t vanocc ) time and O(mq) space.

81 citations

Journal ArticleDOI
TL;DR: An efficient algorithm for finding all tandem repeats within a sequence, under the edit distance measure is described and will assist biologists in discovering new ways that tandem repeats affect both the structure and function of DNA and protein molecules.
Abstract: Motivation: A tandem repeat in DNA is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats occur in the genomes of both eukaryotic and prokaryotic organisms. They are important in numerous fields including disease diagnosis, mapping studies, human identity testing (DNA fingerprinting), sequence homology and population studies. Although tandem repeats have been used by biologists for many years, there are few tools available for performing an exhaustive search for all tandem repeats in a given sequence. Results: In this paper we describe an efficient algorithm for finding all tandem repeats within a sequence, under the edit distance measure. The contributions of this paper are two-fold: theoretical and practical. We present a precise definition for tandem repeats over the edit distance and an efficient, deterministic algorithm for finding these repeats. Availability: The algorithm has been implemented in C++, and the software is available upon request and can be used at http://www.sci.brooklyn.cuny.edu/~sokol/trepeats. The use of this tool will assist biologists in discovering new ways that tandem repeats affect both the structure and function of DNA and protein molecules. Contact: sokol@sci.brooklyn.cuny.edu

81 citations

Journal ArticleDOI
01 Sep 2010
TL;DR: An analysis on existing ER measures is conducted, showing that they can often conflict with each other by ranking the results of ER algorithms differently, and an efficient linear-time algorithm is presented that correctly computes the GMD measure for a large class of cost functions that satisfy reasonable properties.
Abstract: Entity Resolution (ER) is the process of identifying groups of records that refer to the same real-world entity. Various measures (e.g., pairwise F1, cluster F1) have been used for evaluating ER results. However, ER measures tend to be chosen in an ad-hoc fashion without careful thought as to what defines a good result for the specific application at hand. In this paper, our contributions are twofold. First, we conduct an analysis on existing ER measures, showing that they can often conflict with each other by ranking the results of ER algorithms differently. Second, we explore a new distance measure for ER (called "generalized merge distance" or GMD) inspired by the edit distance of strings, using cluster splits and merges as its basic operations. A significant advantage of GMD is that the cost functions for splits and merges can be configured, enabling us to clearly understand the characteristics of a defined GMD measure. Surprisingly, a state-of-the-art clustering measure called Variation of Information is a special case of our configurable GMD measure, and the widely used pairwise F1 measure can be directly computed using GMD. We present an efficient linear-time algorithm that correctly computes the GMD measure for a large class of cost functions that satisfy reasonable properties.

81 citations

Proceedings ArticleDOI
01 Jan 2010
TL;DR: The lower bound is the first to expose hardness of edit distance stemming from the input strings being ``repetitive'', which means that many of their substrings are approximately identical, and provides the first rigorous separation between edit distance and Ulam distance.
Abstract: We present a near-linear time algorithm that approximates the edit distance between two strings within a polylogarithmic factor. For strings of length $n$ and every fixed $\eps>0$, the algorithm computes a $(\log n)^{O(1/\eps)}$ approximation in $n^{1+\eps}$ time. This is an {\em exponential} improvement over the previously known approximation factor, $2^{\tilde O(\sqrt{\log n})}$, with a comparable running time [Ostrovsky and Rabani, J. ACM 2007, Andoni and Onak, STOC 2009]. This result arises naturally in the study of a new \emph{asymmetric query} model. In this model, the input consists of two strings $x$ and $y$, and an algorithm can access $y$ in an unrestricted manner, while being charged for querying every symbol of $x$. Indeed, we obtain our main result by designing an algorithm that makes a small number of queries in this model. We then provide a nearly-matching lower bound on the number of queries. Our lower bound is the first to expose hardness of edit distance stemming from the input strings being ``repetitive'', which means that many of their substrings are approximately identical. Consequently, our lower bound provides the first rigorous separation between edit distance and Ulam distance.

80 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139