scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Proceedings ArticleDOI
13 May 2012
TL;DR: This paper introduces a new algorithm which modifies the dynamic programming method to reduce its amount of data storage and eliminate control flow divergences and shows that the GPU implementation is up to 8x faster when operating on a large number of sequences.
Abstract: The problem of finding the edit distance between two sequences (and its closely related problem of longest common subsequence) are important problems with applications in many domains like virus scanners, security kernels, natural language translation and genome sequence alignment. The traditional dynamic-programming based algorithm is hard to parallelize on SIMD processors as the algorithm is memory intensive and has many divergent control paths. In this paper we introduce a new algorithm which modifies the dynamic programming method to reduce its amount of data storage and eliminate control flow divergences. Our algorithm divides the problem into independent ‘quadrants’ and makes efficient use of shared memory and registers available in GPUs to store data between different phases of the algorithm. Further, we eliminate any control flow divergences by embedding condition variables in the program logic to ensure all the threads execute the same instructions even though they work on different data items. We present an implementation of this algorithm on an NVIDIA GeForce GTX 275 GPU and compare against an optimized multi-threaded implementation on an Intel Core i7-920 quad core CPU with hyper-threading support. Our results show that our GPU implementation is up to 8x faster when operating on a large number of sequences.

21 citations

Posted Content
TL;DR: In this article, the authors present a generic platform for representing many popular compression schemes including the LZ-family, Run-Length Encoding, Byte-Pair Encoding and dictionary methods.
Abstract: The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N^2) time. To this date, this quadratic upper-bound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o(N^2) edit-distance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single edit-distance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straight line programs. These provide a generic platform for representing many popular compression schemes including the LZ-family, Run-Length Encoding, Byte-Pair Encoding, and dictionary methods. For two strings of total length N having straight-line program representations of total size n, we present an algorithm running in O(nN log(N/n)) time for computing the edit-distance of these two strings under any rational scoring function, and an O(n^{2/3}N^{4/3}) time algorithm for arbitrary scoring functions. Our new result, while providing a signi cant speed up for highly compressible strings, does not surpass the quadratic time bound even in the worst case scenario.

21 citations

Journal ArticleDOI
TL;DR: A new distance measure is proposed, called the Operation edit distance, which enables the RFID customer shopping path data to be processed effectively using clustering algorithms, and effectively determined customers’ shopping patterns from the data.
Abstract: Radio frequency identification (RFID) technology has been successfully applied to gather customers' shopping habits from their motion paths and other behavioral data. The customers' behavioral data can be used for marketing purposes, such as improving the store layout or optimizing targeted promotions to specific customers. Some data mining techniques, such as clustering algorithms can be used to discover customers' hidden behaviors from their shopping paths. However, shopping path data has peculiar challenges, including variable length, sequential data, and the need for a special distance measure. Due to these challenges, traditional clustering algorithms cannot be applied to shopping path data. In this paper, we analyze customer behavior from their shopping path data by using a clustering algorithm. We propose a new distance measure for shopping path data, called the Operation edit distance, to solve the aforementioned problems. The proposed distance method enables the RFID customer shopping path data to be processed effectively using clustering algorithms. We have collected a real-world shopping path data from a retail store and applied our method to the dataset. The proposed method effectively determined customers' shopping patterns from the data.

21 citations

Proceedings ArticleDOI
24 Apr 1999
TL;DR: An O(mn log n)-time algorithm for the problem of normalized edit distance computation when the cost function is uniform, except substitutions can have different weights depending on whether they are matching or non-matching.
Abstract: A common model for computing the similarity of two strings X and Y of lengths m, and n respectively with m/spl ges/n, is to transform X into Y through a sequence of three types of edit operations: insertion, deletion, and substitution. The model assumes a given cost function which assigns a non-negative real weight to each edit operation. The amortized weight for a given edit sequence is the ratio of its weight to its length, and the minimum of this ratio over all edit sequences is the normalized edit distance. Existing algorithms for normalized edit distance computation with proven complexity bounds require O(mn/sup 2/) time in the worst-case. We give an O(mn log n)-time algorithm for the problem when the cost function is uniform, i.e., the weight of each edit operation is constant within the same type, except substitutions can have different weights depending on whether they are matching or non-matching.

21 citations

Book ChapterDOI
29 Jun 2015
TL;DR: This work first shows how to extend this algorithm for approximate circular dictionary matching; and proposes an alternative method that is suitable for more divergent sequences, and implemented these methods in BEAR, a programme for improving multiple circular sequence alignment.
Abstract: Multiple sequence alignment is a core computational task in bioinformatics and has been extensively studied over the past decades. This computation requires an implicit assumption on the input data: the left- and right-most position for each sequence is relevant. However, this is not the case for circular structures; for instance, MtDNA. Efforts have been made to address this issue but it is far from being solved. We have very recently introduced a fast algorithm for approximate circular string matching Barton et al., Algo Mol Biol, 2014. Here, we first show how to extend this algorithm for approximate circular dictionary matching; and, then, apply this solution with agglomerative hierarchical clustering to find a sufficiently good rotation for each sequence. Furthermore, we propose an alternative method that is suitable for more divergent sequences. We implemented these methods in BEAR, a programme for improving multiple circular sequence alignment. Experimental results, using real and synthetic data, show the high accuracy and efficiency of these new methods in terms of the inferred likelihood-based phylogenies.

21 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139