scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: This work considers the problem for which an error threshold, k, is given, and the goal is to find all locations in for which there exists a bijection π which maps (p) into the appropriate |p mismatched mapped elements.
Abstract: Two equal length strings s and s′, over alphabets Σs and Σs′, parameterize match if there exists a bijection π : Σs r Σs′ such that π (s) = s′, where π (s) is the renaming of each character of s via π. Parameterized matching is the problem of finding all parameterized matches of a pattern string p in a text t, and approximate parameterized matching is the problem of finding at each location a bijection π that maximizes the number of characters that are mapped from p to the appropriate vpv-length substring of t.Parameterized matching was introduced as a model for software duplication detection in software maintenance systems and also has applications in image processing and computational biology. For example, approximate parameterized matching models image searching with variable color maps in the presence of errors.We consider the problem for which an error threshold, k, is given, and the goal is to find all locations in t for which there exists a bijection π which maps p into the appropriate vpv-length substring of t with at most k mismatched mapped elements. Our main result is an algorithm for this problem with O(nk1.5 p mk log m) time complexity, where m = vpv and n=vtv. We also show that when vpv = vtv = m, the problem is equivalent to the maximum matching problem on graphs, yielding a O(m p k1.5) solution.

66 citations

Journal ArticleDOI
TL;DR: The proposed solution has a number of advantages that help detect malicious programs efficiently on personal computers and is based solely on determining the similarity in code and data area positions which makes the algorithm effective against many ways of protecting executable code.
Abstract: One of the main trends in the modern anti-virus industry is the development of algorithms that help estimate the similarity of files. Since malware writers tend to use increasingly complex techniques to protect their code such as obfuscation and polymorphism, anti-virus software vendors face problems of the increasing difficulty of file scanning, the considerable growth of anti-virus databases, and file storages overgrowth. For solving such problems, a static analysis of files appears to be of some interest. Its use helps determine those file characteristics that are necessary for their comparison without executing malware samples within a protected environment. The solution provided in this article is based on the assumption that different samples of the same malicious program have a similar order of code and data areas. Each such file area may be characterized not only by its length, but also by its homogeneity. In other words, the file may be characterized by the complexity of its data order. Our approach consists of using wavelet analysis for the segmentation of files into segments of different entropy levels and using edit distance between sequence segments to determine the similarity of the files. The proposed solution has a number of advantages that help detect malicious programs efficiently on personal computers. First, this comparison does not take into account the functionality of analysed files and is based solely on determining the similarity in code and data area positions which makes the algorithm effective against many ways of protecting executable code. On the other hand, such a comparison may result in false alarms. Therefore, our solution is useful as a preliminary test that triggers the running of additional checks. Second, the method is relatively easy to implement and does not require code disassembly or emulation. And, third, the method makes the malicious file record compact which is significant when compiling anti-virus databases.

65 citations

Posted Content
TL;DR: This work shows how to improve the space and/or remove a dependency on the alphabet size for each problem using either an improved tabulation technique of an existing algorithm or by combining known algorithms in a new way.
Abstract: We study 4 problems in string matching, namely, regular expression matching, approximate regular expression matching, string edit distance, and subsequence indexing, on a standard word RAM model of computation that allows logarithmic-sized words to be manipulated in constant time. We show how to improve the space and/or remove a dependency on the alphabet size for each problem using either an improved tabulation technique of an existing algorithm or by combining known algorithms in a new way.

65 citations

Proceedings ArticleDOI
18 Jun 2014
TL;DR: This work proposes a novel pivotal prefix filter which significantly reduces the number of signatures and develops a dynamic programming method to select high-quality pivotal prefix signatures to prune dissimilar strings with non-consecutive errors to the query.
Abstract: We study the string similarity search problem with edit-distance constraints, which, given a set of data strings and a query string, finds the similar strings to the query. Existing algorithms use a signature-based framework. They first generate signatures for each string and then prune the dissimilar strings which have no common signatures to the query. However existing methods involve large numbers of signatures and many signatures are unnecessary. Reducing the number of signatures not only increases the pruning power but also decreases the filtering cost. To address this problem, we propose a novel pivotal prefix filter which significantly reduces the number of signatures. We prove the pivotal filter achieves larger pruning power and less filtering cost than state-of-the-art filters. We develop a dynamic programming method to select high-quality pivotal prefix signatures to prune dissimilar strings with non-consecutive errors to the query. We propose an alignment filter that considers the alignments between signatures to prune large numbers of dissimilar pairs with consecutive errors to the query. Experimental results on three real datasets show that our method achieves high performance and outperforms the state-of-the-art methods by an order of magnitude.

65 citations

Journal ArticleDOI
TL;DR: A new metric for sequence comparison that emphasizes global similarity over sequential matching at the local level is described, which has the advantage over the Levenshtein metric that strings of lengths n and m can be compared in time proportional to n + m instead of nm.

65 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139