scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: This work believes that their ensemble is the first ever classifier to significantly outperform DTW and raises the bar for future work in this area, and demonstrates that the ensemble is more accurate than approaches not based in the time domain.
Abstract: Several alternative distance measures for comparing time series have recently been proposed and evaluated on time series classification (TSC) problems. These include variants of dynamic time warping (DTW), such as weighted and derivative DTW, and edit distance-based measures, including longest common subsequence, edit distance with real penalty, time warp with edit, and move---split---merge. These measures have the common characteristic that they operate in the time domain and compensate for potential localised misalignment through some elastic adjustment. Our aim is to experimentally test two hypotheses related to these distance measures. Firstly, we test whether there is any significant difference in accuracy for TSC problems between nearest neighbour classifiers using these distance measures. Secondly, we test whether combining these elastic distance measures through simple ensemble schemes gives significantly better accuracy. We test these hypotheses by carrying out one of the largest experimental studies ever conducted into time series classification. Our first key finding is that there is no significant difference between the elastic distance measures in terms of classification accuracy on our data sets. Our second finding, and the major contribution of this work, is to define an ensemble classifier that significantly outperforms the individual classifiers. We also demonstrate that the ensemble is more accurate than approaches not based in the time domain. Nearly all TSC papers in the data mining literature cite DTW (with warping window set through cross validation) as the benchmark for comparison. We believe that our ensemble is the first ever classifier to significantly outperform DTW and as such raises the bar for future work in this area.

443 citations

Journal ArticleDOI
TL;DR: An algorithm is presented to construct a deterministic finite-state automaton that solves the problem of locating in any string a substring whose edit distance from p is at most a given constant t.

413 citations

Journal ArticleDOI
01 Aug 2009
TL;DR: Three novel methods to compute the upper and lower bounds for the edit distance between two graphs in polynomial time are introduced and result shows that these methods achieve good scalability in terms of both the number of graphs and the size of graphs.
Abstract: Graph data have become ubiquitous and manipulating them based on similarity is essential for many applications. Graph edit distance is one of the most widely accepted measures to determine similarities between graphs and has extensive applications in the fields of pattern recognition, computer vision etc. Unfortunately, the problem of graph edit distance computation is NP-Hard in general. Accordingly, in this paper we introduce three novel methods to compute the upper and lower bounds for the edit distance between two graphs in polynomial time. Applying these methods, two algorithms AppFull and AppSub are introduced to perform different kinds of graph search on graph databases. Comprehensive experimental studies are conducted on both real and synthetic datasets to examine various aspects of the methods for bounding graph edit distance. Result shows that these methods achieve good scalability in terms of both the number of graphs and the size of graphs. The effectiveness of these algorithms also confirms the usefulness of using our bounds in filtering and searching of graphs.

413 citations

Proceedings ArticleDOI
01 Nov 2010
TL;DR: This paper develops a methodology to detect domain fluxing as used by Conficker botnet with minimal false positives and applies it to packet traces collected at a Tier-1 ISP.
Abstract: Recent Botnets such as Conficker, Kraken and Torpig have used DNS based "domain fluxing" for command-and-control, where each Bot queries for existence of a series of domain names and the owner has to register only one such domain name. In this paper, we develop a methodology to detect such "domain fluxes" in DNS traffic by looking for patterns inherent to domain names that are generated algorithmically, in contrast to those generated by humans. In particular, we look at distribution of alphanumeric characters as well as bigrams in all domains that are mapped to the same set of IP-addresses. We present and compare the performance of several distance metrics, including KL-distance, Edit distance and Jaccard measure. We train by using a good data set of domains obtained via a crawl of domains mapped to all IPv4 address space and modeling bad data sets based on behaviors seen so far and expected. We also apply our methodology to packet traces collected at a Tier-1 ISP and show we can automatically detect domain fluxing as used by Conficker botnet with minimal false positives.

405 citations

Journal ArticleDOI
TL;DR: The set of allowable edit operations is extended to include the operation of interchanging the positions of two adjacent characters under certain restrictions on edit-operation costs, and it is shown that the extended problem can still be solved in time proportional to the product of the lengths of the given strings.
Abstract: The string-to-string correction problem asks for a sequence S of "edit operations" of minimal cost such that ~(A) = B, for given strings A and B The edit operations previously investi- gated allow changing one symbol of a string into another single symbol, deleting one symbol from a string, or inserting a single symbol into a string This paper extends the set of allowable edit opera- tions to include the operation of interchanging the positions of two adjacent characters Under certain restrictions on edit-operation costs, it is shown that the extended problem can still be solved in time proportional to the product of the lengths of the given strings

350 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139