scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Patent
06 Apr 2015
TL;DR: In this article, the syntactic edit distance between domain names and each other domain names is determined based on syntax strings of the corresponding domain names, and the client device is identified as a potential source of malicious software based on the clusters.
Abstract: Examples relate to determining string similarity using syntactic edit distance. In one example, a computing device may: receive domain name system (DNS) packets that were sent by a client device, each DNS packet specifying a domain name; generate, for each domain name, a syntax string by replacing each character of the domain name with one of a plurality of metacharacters, each metacharacter representing a category of characters that is different from each other category of characters represented by each other metacharacter; determine, for each domain name, a syntactic edit distance between the domain name and each other domain name, the syntactic edit distance between domain names being determined based on syntax strings of the corresponding domain names; cluster each domain name into one of a plurality of clusters based on the syntactic edit distances; and identify the client device as a potential source of malicious software based on the clusters.

21 citations

Book ChapterDOI
22 Aug 2005
TL;DR: A new algorithm to compute a similarity measure between two cyclic sequences based on Dynamic Time Warping is presented, which computes the optimal alignment between both sequences and is based on the cyclic edit distance algorithm proposed by Maes.
Abstract: Cyclic strings are strings with no starting or ending point, such as those describing a closed contour. We present a new algorithm to compute a similarity measure between two cyclic sequences based on Dynamic Time Warping. The algorithm computes the optimal alignment between both sequences and is based on the cyclic edit distance algorithm proposed by Maes. The algorithm runs in O(mnlgm) time, where m and n are the lengths of the compared strings. Experiments on a shape classification and shape retrieval with a public database are presented.

21 citations

Proceedings Article
03 Jun 2012
TL;DR: In this article, the authors proposed a new segmentation evaluation metric, called segmentation similarity (S), which quantifies the similarity between two segmentations as the proportion of boundaries that are not transformed when comparing them using edit distance.
Abstract: We propose a new segmentation evaluation metric, called segmentation similarity (S), that quantifies the similarity between two segmentations as the proportion of boundaries that are not transformed when comparing them using edit distance, essentially using edit distance as a penalty function and scaling penalties by segmentation size. We propose several adapted inter-annotator agreement coefficients which use S that are suitable for segmentation. We show that S is configurable enough to suit a wide variety of segmentation evaluations, and is an improvement upon the state of the art. We also propose using inter-annotator agreement coefficients to evaluate automatic segmenters in terms of human performance.

21 citations

Posted Content
TL;DR: A unified framework for approximate pattern matching for both considered distances is obtained and meta-algorithms that only rely on a small set of primitive operations are provided that provide generality with results for the fully compressed setting, the dynamic setting, and the standard setting.
Abstract: Approximate pattern matching is a natural and well-studied problem on strings: Given a text $T$, a pattern $P$, and a threshold $k$, find (the starting positions of) all substrings of $T$ that are at distance at most $k$ from $P$. We consider the two most fundamental string metrics: the Hamming distance and the edit distance. Under the Hamming distance, we search for substrings of $T$ that have at most $k$ mismatches with $P$, while under the edit distance, we search for substrings of $T$ that can be transformed to $P$ with at most $k$ edits. Exact occurrences of $P$ in $T$ have a very simple structure: If we assume for simplicity that $|T| \le 3|P|/2$ and trim $T$ so that $P$ occurs both as a prefix and as a suffix of $T$, then both $P$ and $T$ are periodic with a common period. However, an analogous characterization for the structure of occurrences with up to $k$ mismatches was proved only recently by Bringmann et al. [SODA'19]: Either there are $O(k^2)$ $k$-mismatch occurrences of $P$ in $T$, or both $P$ and $T$ are at Hamming distance $O(k)$ from strings with a common period $O(m/k)$. We tighten this characterization by showing that there are $O(k)$ $k$-mismatch occurrences in the case when the pattern is not (approximately) periodic, and we lift it to the edit distance setting, where we tightly bound the number of $k$-edit occurrences by $O(k^2)$ in the non-periodic case. Our proofs are constructive and let us obtain a unified framework for approximate pattern matching for both considered distances. We showcase the generality of our framework with results for the fully-compressed setting (where $T$ and $P$ are given as a straight-line program) and for the dynamic setting (where we extend a data structure of Gawrychowski et al. [SODA'18]).

21 citations

Proceedings ArticleDOI
02 Oct 2019
TL;DR: An algorithm for distinguishing whether the edit distance is at most t or at least t^2 (the quadratic gap problem) in time Õ(n/t+t^3).
Abstract: —The edit distance is a way of quantifying how similar two strings are to one another by counting the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. A simple dynamic programming computes the edit distance between two strings of length n in O(n2) time, and a more sophisticated algorithm runs in time O(n + t2) when the edit distance is t [Landau, Myers and Schmidt, SICOMP 1998]. In pursuit of obtaining faster running time, the last couple of decades have seen a flurry of research on approximating edit distance, including polylogarithmic approximation in near-linear time [Andoni, Krauthgamer and Onak, FOCS 2010], and a constant-factor approximation in subquadratic time [Chakrabarty, Das, Goldenberg, Kouck´y and Saks, FOCS 2018]. We study sublinear-time algorithms for small edit distance, which was investigated extensively because of its numerous applications. Our main result is an algorithm for distinguishing whether the edit distance is at most t or at least t^2 (the quadratic gap problem) in time O(n/t+t^3). This time bound is sublinear roughly for all t in [ω(1), o(n^1/3)], which was not known before. The best previous algorithms solve this problem in sublinear time only for t=ω(n^1/3) [Andoni and Onak, STOC 2009]. Our algorithm is based on a new approach that adaptively switches between uniform sampling and reading contiguous blocks of the input strings. In contrast, all previous algorithms choose which coordinates to query non-adaptively. Moreover, it can be extended to solve the t vs t^2-e gap problem in time O(n/t^1-e+t^3).

21 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139