Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Approximate parameterized matching

[...]

Carmit Hazay¹, Moshe Lewenstein¹, Dina Sokol²•Institutions (2)

Bar-Ilan University¹, City University of New York²

01 Aug 2007-ACM Transactions on Algorithms

TL;DR: This work considers the problem for which an error threshold, k, is given, and the goal is to find all locations in for which there exists a bijection π which maps (p) into the appropriate |p mismatched mapped elements.

...read moreread less

Abstract: Two equal length strings s and s′, over alphabets Σs and Σs′, parameterize match if there exists a bijection π : Σs r Σs′ such that π (s) = s′, where π (s) is the renaming of each character of s via π. Parameterized matching is the problem of finding all parameterized matches of a pattern string p in a text t, and approximate parameterized matching is the problem of finding at each location a bijection π that maximizes the number of characters that are mapped from p to the appropriate vpv-length substring of t.Parameterized matching was introduced as a model for software duplication detection in software maintenance systems and also has applications in image processing and computational biology. For example, approximate parameterized matching models image searching with variable color maps in the presence of errors.We consider the problem for which an error threshold, k, is given, and the goal is to find all locations in t for which there exists a bijection π which maps p into the appropriate vpv-length substring of t with at most k mismatched mapped elements. Our main result is an algorithm for this problem with O(nk1.5 p mk log m) time complexity, where m = vpv and n=vtv. We also show that when vpv = vtv = m, the problem is equivalent to the maximum matching problem on graphs, yielding a O(m p k1.5) solution.
...read moreread less

66 citations

Journal Article•DOI•

Comparing files using structural entropy

[...]

Ivan Sorokin

01 Nov 2011-Journal of Computer Virology and Hacking Techniques

TL;DR: The proposed solution has a number of advantages that help detect malicious programs efficiently on personal computers and is based solely on determining the similarity in code and data area positions which makes the algorithm effective against many ways of protecting executable code.
...read moreread less

Abstract: One of the main trends in the modern anti-virus industry is the development of algorithms that help estimate the similarity of files. Since malware writers tend to use increasingly complex techniques to protect their code such as obfuscation and polymorphism, anti-virus software vendors face problems of the increasing difficulty of file scanning, the considerable growth of anti-virus databases, and file storages overgrowth. For solving such problems, a static analysis of files appears to be of some interest. Its use helps determine those file characteristics that are necessary for their comparison without executing malware samples within a protected environment. The solution provided in this article is based on the assumption that different samples of the same malicious program have a similar order of code and data areas. Each such file area may be characterized not only by its length, but also by its homogeneity. In other words, the file may be characterized by the complexity of its data order. Our approach consists of using wavelet analysis for the segmentation of files into segments of different entropy levels and using edit distance between sequence segments to determine the similarity of the files. The proposed solution has a number of advantages that help detect malicious programs efficiently on personal computers. First, this comparison does not take into account the functionality of analysed files and is based solely on determining the similarity in code and data area positions which makes the algorithm effective against many ways of protecting executable code. On the other hand, such a comparison may result in false alarms. Therefore, our solution is useful as a preliminary test that triggers the running of additional checks. Second, the method is relatively easy to implement and does not require code disassembly or emulation. And, third, the method makes the malicious file record compact which is significant when compiling anti-virus databases.
...read moreread less

65 citations

Posted Content•

Fast and Compact Regular Expression Matching

[...]

Philip Bille¹, Martin Farach-Colton²•Institutions (2)
IT University of Copenhagen¹, Rutgers University²

22 Sep 2005-arXiv: Data Structures and Algorithms

TL;DR: This work shows how to improve the space and/or remove a dependency on the alphabet size for each problem using either an improved tabulation technique of an existing algorithm or by combining known algorithms in a new way.
...read moreread less

Abstract: We study 4 problems in string matching, namely, regular expression matching, approximate regular expression matching, string edit distance, and subsequence indexing, on a standard word RAM model of computation that allows logarithmic-sized words to be manipulated in constant time. We show how to improve the space and/or remove a dependency on the alphabet size for each problem using either an improved tabulation technique of an existing algorithm or by combining known algorithms in a new way.
...read moreread less

65 citations

Proceedings Article•DOI•

A pivotal prefix based filtering algorithm for string similarity search

[...]

Dong Deng¹, Guoliang Li¹, Jianhua Feng¹•Institutions (1)
Tsinghua University¹

18 Jun 2014
TL;DR: This work proposes a novel pivotal prefix filter which significantly reduces the number of signatures and develops a dynamic programming method to select high-quality pivotal prefix signatures to prune dissimilar strings with non-consecutive errors to the query.
...read moreread less
Abstract: We study the string similarity search problem with edit-distance constraints, which, given a set of data strings and a query string, finds the similar strings to the query. Existing algorithms use a signature-based framework. They first generate signatures for each string and then prune the dissimilar strings which have no common signatures to the query. However existing methods involve large numbers of signatures and many signatures are unnecessary. Reducing the number of signatures not only increases the pruning power but also decreases the filtering cost. To address this problem, we propose a novel pivotal prefix filter which significantly reduces the number of signatures. We prove the pivotal filter achieves larger pruning power and less filtering cost than state-of-the-art filters. We develop a dynamic programming method to select high-quality pivotal prefix signatures to prune dissimilar strings with non-consecutive errors to the query. We propose an alignment filter that considers the alignments between signatures to prune large numbers of dissimilar pairs with consecutive errors to the query. Experimental results on three real datasets show that our method achieves high performance and outperforms the state-of-the-art methods by an order of magnitude.
...read moreread less
65 citations

Journal Article•DOI•

A new distance metric on strings computable in linear time

[...]

Andrzej Ehrenfeucht¹, David Haussler²•Institutions (2)
University of Colorado Boulder¹, University of California, Santa Cruz²

01 Jul 1988-Discrete Applied Mathematics

TL;DR: A new metric for sequence comparison that emphasizes global similarity over sequential matching at the local level is described, which has the advantage over the Levenshtein metric that strings of lengths n and m can be compared in time proportional to n + m instead of nm.
...read moreread less

65 citations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
…
45
46
47
48
49
50
51
…
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
Collapse

Network Information

Performance

Metrics

3,030

Papers
78,281

Citations

No. of papers in the topic in previous years
Year	Papers
2023	39
2022	96
2021	111
2020	149
2019	145
2018	139

Edit distance

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics