Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

An algorithm for fast edit distance computation on GPUs

[...]

Reza Farivar¹, Harshit Kharbanda¹, Shivaram Venkataraman², Roy H. Campbell¹•Institutions (2)

University of Illinois at Urbana–Champaign¹, University of California, Berkeley²

13 May 2012

TL;DR: This paper introduces a new algorithm which modifies the dynamic programming method to reduce its amount of data storage and eliminate control flow divergences and shows that the GPU implementation is up to 8x faster when operating on a large number of sequences.

...read moreread less

Abstract: The problem of finding the edit distance between two sequences (and its closely related problem of longest common subsequence) are important problems with applications in many domains like virus scanners, security kernels, natural language translation and genome sequence alignment. The traditional dynamic-programming based algorithm is hard to parallelize on SIMD processors as the algorithm is memory intensive and has many divergent control paths. In this paper we introduce a new algorithm which modifies the dynamic programming method to reduce its amount of data storage and eliminate control flow divergences. Our algorithm divides the problem into independent ‘quadrants’ and makes efficient use of shared memory and registers available in GPUs to store data between different phases of the algorithm. Further, we eliminate any control flow divergences by embedding condition variables in the program logic to ensure all the threads execute the same instructions even though they work on different data items. We present an implementation of this algorithm on an NVIDIA GeForce GTX 275 GPU and compare against an optimized multi-threaded implementation on an Intel Core i7-920 quad core CPU with hyper-threading support. Our results show that our GPU implementation is up to 8x faster when operating on a large number of sequences.

...read moreread less

21 citations

Posted Content•

Unified Compression-Based Acceleration of Edit-Distance Computation

[...]

Danny Hermelin¹, Gad M. Landau², Shir Landau³, Oren Weimann²•Institutions (3)

Max Planck Society¹, University of Haifa², Tel Aviv University³

07 Apr 2010-arXiv: Data Structures and Algorithms

TL;DR: In this article, the authors present a generic platform for representing many popular compression schemes including the LZ-family, Run-Length Encoding, Byte-Pair Encoding and dictionary methods.

...read moreread less

Abstract: The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N^2) time. To this date, this quadratic upper-bound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o(N^2) edit-distance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single edit-distance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straight line programs. These provide a generic platform for representing many popular compression schemes including the LZ-family, Run-Length Encoding, Byte-Pair Encoding, and dictionary methods. For two strings of total length N having straight-line program representations of total size n, we present an algorithm running in O(nN log(N/n)) time for computing the edit-distance of these two strings under any rational scoring function, and an O(n^{2/3}N^{4/3}) time algorithm for arbitrary scoring functions. Our new result, while providing a signi cant speed up for highly compressible strings, does not surpass the quadratic time bound even in the worst case scenario.

...read moreread less

21 citations

Journal Article•DOI•

Analyzing customer behavior from shopping path data using operation edit distance

[...]

M. Alex Syaekhoni¹, Chanseung Lee, Young S. Kwon¹•Institutions (1)

Dongguk University¹

01 Aug 2018-Applied Intelligence

TL;DR: A new distance measure is proposed, called the Operation edit distance, which enables the RFID customer shopping path data to be processed effectively using clustering algorithms, and effectively determined customers’ shopping patterns from the data.

...read moreread less

Abstract: Radio frequency identification (RFID) technology has been successfully applied to gather customers' shopping habits from their motion paths and other behavioral data. The customers' behavioral data can be used for marketing purposes, such as improving the store layout or optimizing targeted promotions to specific customers. Some data mining techniques, such as clustering algorithms can be used to discover customers' hidden behaviors from their shopping paths. However, shopping path data has peculiar challenges, including variable length, sequential data, and the need for a special distance measure. Due to these challenges, traditional clustering algorithms cannot be applied to shopping path data. In this paper, we analyze customer behavior from their shopping path data by using a clustering algorithm. We propose a new distance measure for shopping path data, called the Operation edit distance, to solve the aforementioned problems. The proposed distance method enables the RFID customer shopping path data to be processed effectively using clustering algorithms. We have collected a real-world shopping path data from a retail store and applied our method to the dataset. The proposed method effectively determined customers' shopping patterns from the data.

...read moreread less

21 citations

Proceedings Article•DOI•

An efficient uniform-cost normalized edit distance algorithm

[...]

Abdullah N. Arslan¹, Ömer Eğecioğlu•Institutions (1)

University of California, Santa Barbara¹

24 Apr 1999

TL;DR: An O(mn log n)-time algorithm for the problem of normalized edit distance computation when the cost function is uniform, except substitutions can have different weights depending on whether they are matching or non-matching.

...read moreread less

Abstract: A common model for computing the similarity of two strings X and Y of lengths m, and n respectively with m/spl ges/n, is to transform X into Y through a sequence of three types of edit operations: insertion, deletion, and substitution. The model assumes a given cost function which assigns a non-negative real weight to each edit operation. The amortized weight for a given edit sequence is the ratio of its weight to its length, and the minimum of this ratio over all edit sequences is the normalized edit distance. Existing algorithms for normalized edit distance computation with proven complexity bounds require O(mn/sup 2/) time in the worst-case. We give an O(mn log n)-time algorithm for the problem when the cost function is uniform, i.e., the weight of each edit operation is constant within the same type, except substitutions can have different weights depending on whether they are matching or non-matching.

...read moreread less

21 citations

Book Chapter•DOI•

Accurate and Efficient Methods to Improve Multiple Circular Sequence Alignment

[...]

Carl Barton¹, Costas S. Iliopoulos², Ritu Kundu², Solon P. Pissis², Ahmad Retha², Fatima Vayani² - Show less +2 more•Institutions (2)

Queen Mary University of London¹, King's College London²

29 Jun 2015

TL;DR: This work first shows how to extend this algorithm for approximate circular dictionary matching; and proposes an alternative method that is suitable for more divergent sequences, and implemented these methods in BEAR, a programme for improving multiple circular sequence alignment.

...read moreread less

Abstract: Multiple sequence alignment is a core computational task in bioinformatics and has been extensively studied over the past decades. This computation requires an implicit assumption on the input data: the left- and right-most position for each sequence is relevant. However, this is not the case for circular structures; for instance, MtDNA. Efforts have been made to address this issue but it is far from being solved. We have very recently introduced a fast algorithm for approximate circular string matching Barton et al., Algo Mol Biol, 2014. Here, we first show how to extend this algorithm for approximate circular dictionary matching; and, then, apply this solution with agglomerative hierarchical clustering to find a sufficiently good rotation for each sequence. Furthermore, we propose an alternative method that is suitable for more divergent sequences. We implemented these methods in BEAR, a programme for improving multiple circular sequence alignment. Experimental results, using real and synthetic data, show the high accuracy and efficiency of these new methods in terms of the inferred likelihood-based phylogenies.

...read moreread less

21 citations

Collapse

Network Information

Performance

Metrics

3,030

Papers

78,281

Citations

No. of papers in the topic in previous years
Year	Papers
2023	39
2022	96
2021	111
2020	149
2019	145
2018	139

Edit distance

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics