scispace - formally typeset
Open AccessJournal ArticleDOI

Efficient and effective KNN sequence search with approximate n-grams

TLDR
This paper devise a pipeline framework over a two-level index for searching KNN in the sequence database using the edit distance and brings various enticing advantages over existing works, including huge reduction on false positive candidates to avoid large overheads on candidate verifications.
Abstract
In this paper, we address the problem of finding k-nearest neighbors (KNN) in sequence databases using the edit distance. Unlike most existing works using short and exact n-gram matchings together with a filter-and-refine framework for KNN sequence search, our new approach allows us to use longer but approximate n-gram matchings as a basis of KNN candidates pruning. Based on this new idea, we devise a pipeline framework over a two-level index for searching KNN in the sequence database. By coupling this framework together with several efficient filtering strategies, i.e. the frequency queue and the well-known Combined Algorithm (CA), our proposal brings various enticing advantages over existing works, including 1) huge reduction on false positive candidates to avoid large overheads on candidate verifications; 2) progressive result update and early termination; and 3) good extensibility to parallel computation. We conduct extensive experiments on three real datasets to verify the superiority of the proposed framework.

read more

Content maybe subject to copyright    Report

Efficient and Effective KNN Sequence Search with
Approximate n-grams
Xiaoli Wang
1
Xiaofeng Ding
2,3
Anthony K.H. Tung
1
Zhenjie Zhang
4
1
Dept. of Computer Science
National University of Singapore
{xiaoli,atung}@comp.nus.edu.sg
2
Dept. of Computer Science
Huazhong University of Sci. & Tech.
xfding@hust.edu.cn
3
Dept. of Computer Science
University of South Australia
4
Advanced Digital Sciences Center
zhenjie@adsc.com.sg
ABSTRACT
In this paper, we address the problem of finding k-nearest
neighbors (KNN) in sequence databases using the edit dis-
tance. Unlike most existing works using short and exact n-
gram matchings together with a filter-and-refine framework
for KNN sequence search, our new approach allows us to use
longer but approximate n-gram matchings as a basis of KN-
N candidates pruning. Based on this new idea, we devise
a pipeline framework over a two-level index for searching
KNN in the sequence database. By coupling this framework
together with several efficient filtering strategies, i.e. the
frequency queue and the well-known Combined Algorithm
(CA), our proposal brings various enticing advantages over
existing works, including 1) huge reduction on false positive
candidates to avoid large overheads on candidate verifica-
tions; 2) progressive result update and early termination;
and 3) good extensibility to parallel computation. We con-
duct extensive experiments on three real datasets to verify
the superiority of the proposed framework.
1. INTRODUCTION
Given a query sequence, the goal of KNN sequence search
is to find k sequences in the database that are most similar to
the query sequence. KNN search on sequences have applica-
tions in a variety of areas including DNA/protein sequence
search [13], approximate keyword search [1], and plagiarism
detection [17, 30].
Our study here is also motivated by the real application
on ebook social annotation systems. In our systems, a large
number of paragraphs are annotated and associated with
comments and discussions
1
. For those who own a physi-
cal copy of the book, our aim is to allow them to retrieve
these annotations into their mobile devices using query by
snapping. As shown in Figure 1, queries are generated by
1
http://readpeer.com
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 40th International Conference on Very Large Data Bases,
September 1st - 5th 2014, Hangzhou, China.
Proceedings of the VLDB Endowment, Vol. 7, No. 1
Copyright 2013 VLDB Endowment 2150-8097/13/09... $ 10.00.
Client
Query
Annotations
User Sever
Figure 1: An example of the book annotation search
users when they use mobile devices to snap a photo of page
in the book. The query photo is then processed by an op-
tical character recognition (OCR) program which extracts
the text from the photo as a sequence. Since the OCR pro-
gram might generate errors within the sequence, we need
to perform an approximate query against the paragraphs in
the server to retrieve those paragraphs that had been an-
notated. Since the range of error in such cases is hard to
determine, a k-nearest neighbor (KNN) search is natural-
ly preferred, avoiding the need to estimate how good the
results generated from the OCR are.
This paper uses edit distance to evaluate the similarity
b etween two sequences. Edit distance is commonly used
in similarity search on large sequence databases, due to it-
s robustness to typical errors in sequences like misspelling
[13]. Existing edit distance algorithms for sequence search
have focused on either approximate searching (e.g., [3, 9,
10, 16, 18, 20, 23]) or KNN similarity search [21, 28, 29].
Although range query has been extensively studied, KNN
search remains a challenging issue. Many efforts on answer-
ing KNN search utilize the filter-and-refine framework [21,
28, 29]. The main idea is to prune off candidates by utilizing
the number of exact matches on a set of n-grams that are
generated from the sequences. An n-gram is a contiguous
subsequence of a particular sequence (also called q-gram).
Although such approaches are effective on short sequence
searches, they are less effective if there is a need to process
sequences that are longer like a page of text in a book. In
this paper, we further investigate the KNN search problem
from the viewpoint of enhancing efficiency.
In this paper, we develop a novel search framework which
uses approximate n-grams as the filtering signatures. This
allows us to use longer n-grams compared to exact matches
which in turn gives more accurate pruning since such match-
ing is less likely to be random. We introduce two novel fil-
tering techniques based on approximate n-grams by relaxing
the filtering conditions. To ensure efficiency, we employ sev-
1

eral strategies. First, we use a frequency queue (f-queue) to
buffer the frequency of the approximate n-grams to support
candidate selection. This can help to avoid frequent candi-
date verification. Second, we develop a novel search strategy
by employing the paradigm of the CA method [6]. By using
the summation of gram edit distances as the aggregation
function, the CA strategy can enhance the KNN search by
avoiding access to sequences with high dissimilarity. Third,
we design a pipeline framework to support simple parallel
pro cessing. These strategies are implemented over a two-
level inverted index. In the upper-level index, n-grams that
are derived from the sequence database are stored in an in-
verted file with their references to the original sequences. In
the lower-level index, each distinct n-gram from the upper-
level is further decomposed into smaller sub-units, and in-
verted lists are constructed to store the references to the
upp er-level grams for each sub-unit. Based on the index,
the search framework has two steps.
In the first step, given a query sequence and its n-grams,
similar n-grams within a range will be quickly returned us-
ing the lower-level index. In the second step, the n-grams
returned from the lower level can be automatically used as
the input to construct the sorted lists in the upper level.
With the sorted lists, our proposed filtering strategies are
employed to enhance the search procedure. Our contribu-
tions in this paper are summarized as follows:
We introduce novel bounds for sequence edit distance
based on approximate n-grams. These bounds offer
new opportunities for improving pruning effectiveness
in sequence matching.
We propose a novel KNN sequence search framework
using several efficient strategies. The f-queue support-
s our proposed filtering techniques with a sequence
buffer for candidate selection. The well-known CA s-
trategy has an excellent property of early termination
for scanning the inverted lists, and the pip eline strat-
egy can effectively make use of parallel processing to
speed up our search.
We propose a pipeline search framework based on a
two-level inverted index. By adopting a carefully staged
processing that starts from searching at the lower-level
n-gram index to ending at the upper-level sorted list
processing, we are able to find KNN for long sequences
in an easily parallelizable manner.
We conduct a series of experiments to compare our
proposed filtering strategies with existing methods. The
results show that our proposed filtering techniques have
better pruning power, and the new filtering strategies
can enhance existing filtering techniques.
The rest of this paper is organized as follows. Section
2 discusses related studies. Section 3 provides preliminary
concepts and basic principles for the KNN search. Section 4
introduces the proposed filtering techniques. Section 5 illus-
trates several efficient strategies to support the KNN search.
Section 6 presents the pipeline search framework with a two-
level inverted index. We evaluate the proposed approaches
with experimental results in Section 7 and conclude the pa-
p er in Section 8.
2. RELATED WORK
Similarity query based on edit distance is a well-studied
problem (e.g., [12, 15, 26]). An extensive survey had been
conducted very early in [13]. Early algorithms are based
on online sequential search, and mainly focus on speed-
ing up the exact sequence edit distance (SED) computa-
tion. Among them, the most efficient algorithm requires
O (|s |
2
/log|s|) time [12] for computing the SED, and only
O (τ |s|) time for testing if the SED is within some thresh-
old τ [29]. However, online search algorithms still suffer
from poor scalability in terms of string length or database
size since they need a full scan on the whole database. To
overcome this drawback, most recent works follow a filter-
and-refine framework. Many indexing techniques have been
prop osed to prune off most of the sequences before verifying
the exact edit distances for a small set of candidates [14].
There are three main indexing ideas: enumerating, back-
tracking and partitioning.
The first idea is introduced for supporting specific queries
when strings are very short or the edit distance threshold
is small (e.g., [2, 24]). It is clear that enumeration usually
have high space complexity and is often impractical in real
query systems.
The second idea is based on branch-and-bound techniques
on tree index structures. In [4, 22], a trie is used to index
all strings in a dictionary. With a trie, all shared prefixes
in the dictionary are collapsed into a single path, so they
can process them in the best order for computing the exact
SEDs. Sub-trie pruning is employed to enhance the effi-
ciency of computing the edit distance. However, building
a trie for all strings is expensive in term of both time and
space complexity. In [29], a B
+
-tree index structure called
B
ed
-tree is proposed to support similarity queries based on
edit distance. Although this index can be implemented on
most modern database systems, it suffers from poor query
p erformance since it has a very weak filtering power.
To improve filtering effectiveness, most existing works em-
ploy the third idea that splits original strings into several s-
maller signatures to reduce the approximate search problem
to an exact signature match problem (e.g., [3, 7, 9, 10, 11,
16, 18, 20, 23, 27]). We further classify these methods based
on their preprocessing methods into the threshold-aware ap-
proaches and the threshold-free approaches. The threshold-
aware approaches have been developed mainly based on the
prefix-filtering framework. Recent work in [23] performed a
detailed studies of these methods [11, 16, 23] and conclude
that the prefix-filtering framework can be enhanced with an
adaptive framework. These methods typically work well on-
ly for a fixed similarity threshold. If the threshold is not
fixed, two choices exist. First, the index has to be built on-
line for each query with a distinct threshold. This could be
time consuming and always be impractical in real system-
s. Second, multiple indexes are constructed offline for all
p ossible thresholds. This choice has high space complexity
esp ecially for databases with long sequences since there can
b e many distinct edit distance thresholds. The threshold-
free approaches generally employ various n-gram based sig-
natures. The basic idea is that if two strings are similar
they should share sufficient common signatures. Compared
to the threshold-aware approaches, these methods general-
ly have much less preprocessing time and space overhead
for storing indexes. However, if we ignore the preprocess-
ing phrase, these methods have been presented to have the
2

worse performance for supporting edit distance similarity
search [16]. This is because they often suffer from poor fil-
tering effectiveness through the use of loose bounds.
Although most of such approaches had been shown to be
efficient for approximate searching with a predefined thresh-
old, limited progress has been made for addressing the KNN
search problem. Existing efforts utilize two kinds of index
mechanisms [21, 28, 29, 5]. The first index mechanism is
adapted from inverted list based index [21, 28]. The KN-
N search algorithm employs the same intuition by selecting
candidates with sufficient number of common n-grams. The
difference between them is the list merging technique. In
[21], the MergeSkip algorithm is employed to reduce the in-
verted list processing time. A predefined threshold based
algorithm is also proposed by repeating the approximate
string queries multiple times to support KNN search. In
[28], the basic length filtering is used to improve list process-
ing. Another index mechanism is based on the tree struc-
ture [29, 5]. In [29], a B
+
-tree based index is proposed to
index database sequences based on some sequence orders.
The tree nodes are iteratively traversed to update the low-
er bound of edit distance and the nodes beyond the bound
are pruned. In the most recent work [5], an in-memory trie
structure is used to index strings and share computation-
s on common prefixes of strings. A range-based method is
prop osed by grouping the pivotal entries to avoid duplicated
computations in the dynamic programming matrix when the
edit distance is computed. Although such approaches are
effective on the short sequence search, their performances
degrade for long sequences since the length of the common
prefix are relatively short for long sequences and the large
number of long, single branches in the trie bring about large
space and computation overhead.
3. PRELIMINARIES
Let Σ be a set of elements, e.g. a finite alphabet of char-
acters in a string database or an infinite set of latitude and
longitude in a trajectory database. We use s to denote a
sequence in Σ
of length |s|, s[i] to denote the ith element,
and s[i, j] to denote a subsequence of s from the ith elemen-
t to the jth element. The common notations used in the
rest of the paper are summarized in Table 1. In this paper,
we employ edit distance as the measure on the dissimilarity
b etween two sequences, which is formalized as follows.
Definition 1. (Sequence Edit Distance)(SED) Given t-
wo sequences s
1
and s
2
, the edit distance between them, de-
noted by λ(s
1
, s
2
), is the minimum number of primitive edit
operations (i.e., insertion, deletion, and substitution) on s
1
that is necessary for transforming s
1
into s
2
.
We focus on k-nearest neighbor (KNN) search based on
the edit distance, following the formal definition as below.
Problem 1. Given a query sequence q and a sequence
database D = {s
1
, s
2
, ..., s
|D|
}, find k sequences {a
1
, a
2
, ..., a
k
}
in D, which are more similar to q than the other sequences,
that is, s
i
D\{a
j
(1 j k)}, λ(s
i
, q) λ(a
j
, q).
3.1 KNN Sequence Search Using N-grams
In this section, we aim to introduce important concepts
and principles of sequence similarity search using n-grams
which is a common technique exploited in existing studies.
Table 1: Notations
Notation Description
D the sequence database
q the query sequence
|s| the length of sequence s
s[i] the i
th
element of sequence s
G
s
the n-gram set of a sequence s
λ(s
1
, s
2
) the edit distance between two sequences s
1
and s
2
λ(g
1
, g
2
) the edit distance between two n-grams g
1
and g
2
µ(s
1
, s
2
) the gram mapping distance between two
sequences s
1
and s
2
ϕ the frequency threshold value of n-grams
k the k value for the KNN search
τ the edit distance threshold
τ(t) the threshold value computed by the ag-
gregation function in the CA method
η(τ, t, n) the number of n-grams affected by τ edit
op erations with gram edit distance > t
Definition 2. (n-gram) Given a sequence s and a posi-
tive integer n, a positional n-gram of s is a pair (i, g), where
g is a subsequence of length n starting at the i
th
element, i.e.,
g = s[i, i + n 1]. The set G (s, n) consists of all n-grams of
s, obtained by sliding a window of length n over sequence s.
In particular, there are |s| n + 1 n-grams in G(s, n).
In this paper, we skip the positional information of the
n-grams. Such a simplified 5-gram set of a sequence intro-
duction, for example, is {intro, ntrod, trodu, roduc, oduct,
ducti, uctio, ction}. The n-gram set is useful in edit dis-
tance similarity evaluation, based on the following obser-
vation: if a sequence s
2
could be transformed to s
1
by τ
primitive edit operations, s
1
and s
2
must share at least
ϕ = (max{|s
1
|, |s
2
|} n + 1) n × τ common n-grams [18].
Algorithm 1 A Simple KNN Sequence Search Algorithm
Require: The n-gram lists L
G
for q, and k
1: Initialize a max-heap H using first visited k sequences;
2: for L
i
L
G
do
3: for all unprocessed s
j
L
i
do
4: frequency[s
j
] + +;
5: τ = max{λ
s
|s H} ;
6: ϕ = max{|s
j
|, |q|} n + 1 n × τ;
7: if f requency[s
j
] ϕ then
8: Compute the edit distance λ(s
j
, q);
9: if λ(s
j
, q) < τ then
10: Up date and maintain the max-heap H;
11: Mark s
j
as a processed sequence;
12: Output the k sequences in H;
Inverted indexes on the n-grams of the sequences are com-
monly used, such that references to original locations of the
same n-gram are kept in a list structure. Algorithm 1 shows
a typical threshold-based algorithm using the inverted index
on the n-grams as well as an auxiliary heap structure. This
algorithm dynamically updates the frequency threshold us-
ing the maximum edit distance maintained in a max-heap
H (lines 6 - 7). The query performance dep ends on the effi-
ciencies of two operations, the inverted list scan and the edit
3

distance computation for the candidate verification (lines 3
- 11).
Algorithm 1 could be improved by using optimization s-
trategies, such as length filtering [28] and MergeSkip [21].
The intuition behind length filtering is as follow: if two se-
quences are within an edit distance of τ, their length d-
ifference is no larger than τ . Therefore, the inverted list
scan is restricted to the sequences within the length con-
straint. Inverted lists are thus sorted in ascending order of
the sequence length. On the other hand, the MergeSkip s-
trategy preprocesses inverted lists such that the references
are sorted in ascending order of the sequence identification
number. When the maximum entry in the max-heap H is
up dated, it is used to compute a new frequency threshold ϕ,
and those unprocessed sequences with frequencies less than
ϕ are skipped. As an example, in Figure 2, sequence no. 10
is first visited and pushed to the top-1 heap. The temporal
frequency threshold is computed as ϕ = 3, and the candi-
date for next visit is sequence no. 35. In this way, sequences
20 and 30 are skipped as their frequencies are less than 3.
TopͲ1heap
10
L
0
10
20
35
50
100
L
1
10
30
35
50
L
2
35
41
50
100
Intermediate results
ID ʄ
10 2 3
35 2
50 1 5
L
5
45
200
L
3
40
41
42
50
100
……
L
4
41
48
50
100
200
50
Pop10
Push50
Figure 2: Illustration of the MergeSkip strategy
Although such approaches may somehow improve the effi-
ciency of list processing, they may have limited performance
since they are strictly relying on the efficient processing of in-
verted lists. For example, the length filtering can be useless
in a database where most sequences are around the same
length. In Figure 2, the top-1 heap is updated when se-
quence no. 50 is visited, the new frequency threshold is
ϕ = 5, and the next visiting candidate is sequence no. 45.
In this case, no sequence may be skipped. The reason is
that sequences from 35 to 45 are located in the grey area
may have b een processed as the frequencies of their matched
n-grams are larger than 3. As the frequency threshold is
a loose bound that can generate too many false positives,
the candidate verification becomes the most time consum-
ing step.
4. NEW FILTERING THEORY
Due to the limited pruning effectiveness of exact n-gram
matching, we aim to develop new theories for sequence search
filtering by using approximate matching between n-grams
of the two sequences. This is motivated by the observations
that using exact n-gram matching will typically require n to
b e small (so that the probability of having exact matching
will not be too low) which will in turn lower the selectivi-
ty of the n-grams. By allowing approximate matching for
these n-grams, we can increase the size of n without com-
promising the chance of a matching taking place, thereby
increasing selectivity of the n-grams and reducing the length
of the inverted list to be scanned. As shown in Definition 2,
an n-gram is a subsequence of the original sequence. Con-
sequently, gram edit distance is computed as the sequence
edit distance between two n-grams.
Count filtering is the first pruning strategy we design
based on gram edit distance, It is an extension of the exist-
ing count filtering on exact n-gram matchings. Basically, we
want to estimate the maximal number of n-grams modified
by τ edit operations such that the gram edit distance be-
tween the affected n-gram and the queried n-gram is larger
than a certain value of t (t 0). This leads to the new
count filtering using approximate n-grams, as is shown in
the following proposition.
Proposition 1. Consider two sequences s
1
and s
2
. If s
1
and s
2
are within an edit distance of τ, then s
1
and s
2
must
have at most η(τ, t, n) = max{1, n 2 × t} + (n t)(τ 1)
n-grams with gram edit distance > t, where t < n.
Proof. Let t = 0. Then η(τ, 0, n) = max{1, n 2 × 0} +
(n 0) × (τ 1) = n × τ . Intuitively, this holds because one
edit operation can mo dify at most n n-grams. Consequently,
τ edit operations can modify at most n × τ n-grams (i.e.,
there are at most n × τ n-grams between s
1
and s
2
with
gram edit distance > 0).
Let t 1. We first analyze the effect of edit operations
on the n-grams with certain gram edit distance (GED). We
show the first edit operation in two cases: it is applied on the
first or last n-1 n-grams, and it is applied into other positions
not within the first or last n-1 n-grams. As shown in Figure
3, in Case 1, one edit operation is applied in the position
in the pink box. Two types of edit operations will affect
n-grams to have different distance distributions. Obviously,
one substitution will cause n n-grams to have GED = 1;
while one insertion or deletion will cause one new n-gram
and n-1 n-grams of various GEDs. Consequently, the upper
b ound value of η(τ, t, n) will cause at least 1 n-gram with
GED = n. We now show the distribution of the GEDs.
As shown in the figure, two 5-grams g
1
and g
5
have GED
= 1 in Figure 3(a). However, two 5-grams g
2
and g
4
can
have GED 2. Generally, one such operation can cause
at most n 2 × t n-grams to have GED > t. Remember
that there are at least one new derived n-gram of GED = n.
Therefore, an upper bound on the number of affected n-
grams with GED > t should be max{1, n 2 × t}. In case
2, one edit operation is applied to the first or last n-1 n-
grams. The total number of affected n-grams, denoted by
n
, is less than n, and the number of affected n-grams have
GED > t should be less than that of Case 1. It is obvious
that the number of affected n-grams in Case 2 is less than
that in Case 1. It is indeed true that Case 1 can infer an
upp er bound value on the affected n-gram number when the
insertion or deletion operation is applied.
We now show how the distribution of edit operations will
affect the maximum number of n-grams with GED > t. Sup-
p ose E = {e
1
, e
2
, . . . , e
τ
} is a series of edit operations that
is needed to transform one sequence into another sequence.
Supp ose the τ edit operations are evenly distributed in a se-
quence. That means no n -gram is simultaneously affected by
multiple edit operations. In this case, the numb er of affect-
ed n-grams can be maximized . As analyzed ab ove, one edit
op eration will affect at most n 2 × t n-grams to have GED
> t. This is the boundary case where the edit operation is
the first or the last operation. It is clear that the number
of affected n-grams with GED > t, on the left of the first
edit operation and on the right of the last edit operation,
is at most max{1, n 2 × t}. For the remaining τ 1 edit
op erations, one new operation will cause nt newly affected
4

Case2
g
1
g
2
g
3
g
4
g
5
Asubstitution
Aninsertion
Case1
(a) (b) (c)
(e)(d) (f)
Aninsertion
Asubstitution
Figure 3: Effect of edit operations on n-grams
10
20
30
10
20
30
40
20
30
20
30
20
30
20
30
20
30
40
10
30
40
40
ID
Freq.
Ȉ
0
10
0 0
20
8 0
30
6 0
40
0 0
ID
Freq.
Ȉ
1
10
2 1
20
8 1
30
7 1
40
0 1
ID
Freq.
Ȉ
2
10
3 4
20
16 4
30
15 4
Candidates
ID
λ
20 0
30 2
Prune none
L
0
20
30
L
1
20
30
L
2
20
30
L
3
20
30
L
4
20
30
L
5
20
30
L
6
20
L
7
20
GED=0
GED=1
GED=2
Figure 4: An example of the count filtering
n-grams ahead its previous edit position, as the boundary
p osition will be affected only in this case. Consequently, the
maximum number of affected n-grams with GED > t would
b e η(τ, t, n) = max{1, n 2 × t} + (n t)(τ 1).
Lemma 1. Consider two sequences s
1
and s
2
. If s
1
and
s
2
are within an edit distance of τ, then s
1
and s
2
must share
at least ϕ
t
(s
1
, s
2
) = |s| n + 1 η(τ, t, n) n-grams with gram
edit distance t. Here, |s| is equal to max{|s
1
|, |s
2
|}.
The proposed count filtering offers new opportunities to
improve the search performance as it has a stronger filtering
ability. As is shown in Figure 4, no sequence is pruned
using the count filtering with common n-grams of ϕ
0
= 0.
By using the count filtering with n-grams of GED = 1,
sequence no. 40 can be pruned by ϕ
1
as its frequency (i.e.,
F req.) of n-grams with GED 1 is less than ϕ
1
. Similarly,
the sequence 10 is pruned by using the count filtering of ϕ
2
.
Mapping filtering is a more complicated pruning strat-
egy, but provides more effective pruning based on the gram
edit distance. To begin with, we first define the distance
b etween two multi-sets of n-grams.
Definition 3. (Gram Mapping Distance)(GMD) Given
two gram multi-sets G
s
1
and G
s
2
of s
1
and s
2
, respectively
with the same cardinality. The mapping distance between
s
1
and s
2
is defined as the sum of distances of the optimal
mapping between their gram multi-sets, and is computed as
µ(s
1
, s
2
) = min
P
g
i
G
s
1
λ(g
i
, P (g
i
)), P : G
s
1
G
s
2
The computation of gram mapping distance is accom-
plished by finding an optimal mapping between two grams
multi-sets. Similar to the work in [25], we can construct a
weighted matrix for each pair of grams from two sequences,
and apply the Hungarian algorithm [8, 19]. Based on gram
mapping distance, we show how a tighter lower bound on
the edit distance between two sequences could be achieved.
Lemma 2. Given two sequences s
1
and s
2
. The gram
mapping distance µ(s
1
, s
2
) between s
1
and s
2
satisfies
µ(s
1
, s
2
) (3n 2) × λ(s
1
, s
2
)
Proof. Let E = {e
1
, e
2
, . . . , e
K
} be a series of edit oper-
ations that is needed to transform s
1
into s
2
. Accordingly,
there is a set of sequences s
1
= M
0
M
1
. . . M
τ
= s
2
,
where M
i1
M
i
indicates that M
i
is the derived sequence
from M
i1
by performing e
i
for 1 i K. Assume there
are K
1
insertion operations, K
2
deletion operations and K
3
substitution op erations, then we have K = K
1
+ K
2
+ K
3
.
We analyze the detailed influence of each type of edit oper-
ation as follows.
Insertion operation: When a character is inserted into
the sequence M
i1
, at most n n-grams are affected. The
edit distance is less than 2 for (n 1) n-grams , and n
for one newly inserted n-gram. Thus, we conclude that
µ(M
i1
, M
i
) [2(n 1) + n] = 3n 2.
Deletion operation: When one character is deleted from
the sequence M
i1
, thus a total number of n n-grams may be
affected. The edit distance is less than 2 for (n1) n-grams,
and n for one newly deleted n-gram. Thus, in the case of
deleting one character, µ(M
i1
, M
i
) [2(n1)+n] = 3n2.
Substitution operation: When a character in sequence M
i1
is substituted by another character, a total number of n n-
grams are affected. Then, the edit distance for each affected
n-gram is equal to 1, and thus we have µ(M
i1
, M
i
) n.
By analyzing the effect of the above three operations, we
conclude that GMD and SED have the following relation-
ship.
µ(s
1
, s
2
) (3n 2) × K
1
+ (3n 2) × K
2
+ n × K
3
(3n 2) × (K
1
+ K
2
+ K
3
)
(3n 2) × λ(s
1
, s
2
)
Lemma 2 naturally brings us a new lower bound esti-
mation method on the sequence edit distance. Given two
sequences s
1
and s
2
, and an edit distance threshold τ , if
µ(s
1
,s
2
)
3n2
> τ , then λ(s
1
, s
2
) > τ. While the bound is ef-
fective, it remains computational expensive if we directly
apply this bound for pre-pruning. In this work, we employ
this bound function to compute the aggregation value in the
CA filtering algorithms. That is, we use the summation of
gram edit distances as the aggregation function, instead of
directly computing the mapping distance. We will intro-
duce new implementing filtering strategies and algorithmic
frameworks to make these theories practical.
5. FILTERING ALGORITHMS
Based on the filtering theories derived in the previous sec-
tion, we introduce new algorithms to support efficient filter-
ing. Given a query sequence q, we assume that there are
existing inverted lists that support efficient search on the
n-grams under specified edit distance constraint, as shown
in Figure 5 and 6 with L
G
= {L
0
, L
1
, ..., L
|q|−n
}.
5

Citations
More filters
Journal ArticleDOI

String similarity search and join: a survey

TL;DR: A comprehensive survey on string similarity search and join is presented and widely-used similarity functions to quantify the similarity are introduced, including approximate entity extraction, type-ahead search, and approximate substring matching.
Proceedings ArticleDOI

Approximate keyword search in semantic trajectory database

TL;DR: An efficient search algorithm and fast evaluation of the minimum value of spatio-textual utility function are proposed and the results of empirical studies based on real check-in datasets demonstrate that the proposed index and algorithms can achieve good scalability.
Proceedings ArticleDOI

Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search

TL;DR: This work recursively partition strings into disjoint segments and builds a hierarchical segment tree index (HS-Tree) on top of the segments to support similarity search, and develops effective pruning techniques to further improve the performance.
Journal ArticleDOI

A Transformation-Based Framework for KNN Set Similarity Search

TL;DR: A transformation based framework to solve the problem of KNN set similarity search, which given a collection of set records and a query set, returns results with the largest similarity to the query.
Journal ArticleDOI

A unified framework for string similarity search with edit-distance constraint

TL;DR: This work recursively partition strings into disjoint segments and builds a hierarchical segment tree index and develops effective pruning techniques to further improve the performance, and extends the techniques to support the disk-based setting.
References
More filters
Journal ArticleDOI

The Hungarian method for the assignment problem

TL;DR: This paper has always been one of my favorite children, combining as it does elements of the duality of linear programming and combinatorial tools from graph theory, and it may be of some interest to tell the story of its origin this article.

The Hungarian Method for the Assignment Problem.

TL;DR: This paper has always been one of my favorite “children,” combining as it does elements of the duality of linear programming and combinatorial tools from graph theory.
Journal ArticleDOI

A guided tour to approximate string matching

TL;DR: This work surveys the current techniques to cope with the problem of string matching that allows errors, and focuses on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms.
Proceedings ArticleDOI

Optimal aggregation algorithms for middleware

TL;DR: An elegant and remarkably simple algorithm is analyzed that is optimal in a much stronger sense than FA, and is essentially optimal, not just for some monotone aggregation functions, but for all of them, and not just in a high-probability sense, but over every database.
Journal ArticleDOI

A faster algorithm computing string edit distances

TL;DR: An algorithm is described for computing the edit distance between two strings of length n and m, n ⪖ m, which requires O(n · max(1, mlog n) steps whenever the costs of edit operations are integral multiples of a single positive real number and the alphabet for the strings is finite.
Related Papers (5)
Frequently Asked Questions (16)
Q1. What contributions have the authors mentioned in the paper "Efficient and effective knn sequence search with approximate n-grams" ?

In this paper, the authors address the problem of finding k-nearest neighbors ( KNN ) in sequence databases using the edit distance. Unlike most existing works using short and exact ngram matchings together with a filter-and-refine framework for KNN sequence search, their new approach allows us to use longer but approximate n-gram matchings as a basis of KNN candidates pruning. Based on this new idea, the authors devise a pipeline framework over a two-level index for searching KNN in the sequence database. By coupling this framework together with several efficient filtering strategies, i. e. the frequency queue and the well-known Combined Algorithm ( CA ), their proposal brings various enticing advantages over existing works, including 1 ) huge reduction on false positive candidates to avoid large overheads on candidate verifications ; 2 ) progressive result update and early termination ; and 3 ) good extensibility to parallel computation. The authors conduct extensive experiments on three real datasets to verify the superiority of the proposed framework. 

As a conclusion, their proposed filtering strategies show excellent performance on the KNN search, and the pipeline framework is easy to extend to parallel computation. 

The f-queue can be used to improve the performance of existing algorithms based on the length filtering or the MergeSkip strategy. 

Edit distance is commonly used in similarity search on large sequence databases, due to its robustness to typical errors in sequences like misspelling [13]. 

To avoid having large overhead on list processing, the authors use the CA based strategy [6] and use the summation of gram edit distances as the aggregation function. 

With a trie, all shared prefixes in the dictionary are collapsed into a single path, so they can process them in the best order for computing the exact SEDs. 

When the maximum entry in the max-heap H is updated, it is used to compute a new frequency threshold ϕ, and those unprocessed sequences with frequencies less than ϕ are skipped. 

When the authors do sorted access to the list L4, each value of ti with i = 0, 1, .., 4 is set to be equal to 2 as no entry has distance of 1, and 2 is the smallest score that can be obtained for unseen elements. 

The CA strategy is used to terminate the whole process if the CA threshold value of the gram edit distance summation is larger than the temporary threshold computed from the top-k heap. 

The results indicate that combining the MergeSkip with the length filter can help to reduce the candidate size and improve the query performance. 

Given two sequences s1 and s2, the edit distance between them, denoted by λ(s1, s2), is the minimum number of primitive edit operations (i.e., insertion, deletion, and substitution) on s1 that is necessary for transforming s1 into s2. 

While existing approaches often suffer from poor filtering power and low query performance when sequences in the database are long, the authors tackle the problem by designing a novel file-and-refine pipeline approach utilizing approximate n-gram matchings. 

Although this index can be implemented on most modern database systems, it suffers from poor query performance since it has a very weak filtering power. 

When the k value is as small as 1, Flamingo can run efficiently as it only needs to execute a range query once to obtain the top-1 result. 

This algorithm dynamically updates the frequency threshold using the maximum edit distance maintained in a max-heap H (lines 6 - 7). 

The intuition behind length filtering is as follow: if two sequences are within an edit distance of τ , their length difference is no larger than τ .