What future works have the authors mentioned in the paper "Efficient and effective knn sequence search with approximate n-grams" ?

As a conclusion, their proposed filtering strategies show excellent performance on the KNN search, and the pipeline framework is easy to extend to parallel computation.

What is the purpose of the f-queue?

The f-queue can be used to improve the performance of existing algorithms based on the length filtering or the MergeSkip strategy.

How do the authors avoid having large overhead on list processing?

To avoid having large overhead on list processing, the authors use the CA based strategy [6] and use the summation of gram edit distances as the aggregation function.

What is the smallest score that can be obtained for unseen elements?

When the authors do sorted access to the list L4, each value of ti with i = 0, 1, .., 4 is set to be equal to 2 as no entry has distance of 1, and 2 is the smallest score that can be obtained for unseen elements.

What is the CA strategy used to terminate the whole process?

The CA strategy is used to terminate the whole process if the CA threshold value of the gram edit distance summation is larger than the temporary threshold computed from the top-k heap.

What is the way to reduce the candidate size?

The results indicate that combining the MergeSkip with the length filter can help to reduce the candidate size and improve the query performance.

How does the pipeline approach solve the problem of k-nearest neighbor sequence search?

While existing approaches often suffer from poor filtering power and low query performance when sequences in the database are long, the authors tackle the problem by designing a novel file-and-refine pipeline approach utilizing approximate n-gram matchings.

What is the main reason why the bed-tree index is not implemented?

Although this index can be implemented on most modern database systems, it suffers from poor query performance since it has a very weak filtering power.

How many times can Flamingo run a range query?

When the k value is as small as 1, Flamingo can run efficiently as it only needs to execute a range query once to obtain the top-1 result.

(Open Access) Efficient and effective KNN sequence search with approximate n-grams (2013) | Xiaoli Wang

Q: What contributions have the authors mentioned in the paper "Efficient and effective knn sequence search with approximate n-grams" ?

In this paper, the authors address the problem of finding k-nearest neighbors ( KNN ) in sequence databases using the edit distance. Unlike most existing works using short and exact ngram matchings together with a filter-and-refine framework for KNN sequence search, their new approach allows us to use longer but approximate n-gram matchings as a basis of KNN candidates pruning. Based on this new idea, the authors devise a pipeline framework over a two-level index for searching KNN in the sequence database. By coupling this framework together with several efficient filtering strategies, i. e. the frequency queue and the well-known Combined Algorithm ( CA ), their proposal brings various enticing advantages over existing works, including 1 ) huge reduction on false positive candidates to avoid large overheads on candidate verifications ; 2 ) progressive result update and early termination ; and 3 ) good extensibility to parallel computation. The authors conduct extensive experiments on three real datasets to verify the superiority of the proposed framework.

Q: What is the way to prune off all the prefixes in a dictionary?

With a trie, all shared prefixes in the dictionary are collapsed into a single path, so they can process them in the best order for computing the exact SEDs.

Q: What is the frequency threshold used to update the max-heap H?

When the maximum entry in the max-heap H is updated, it is used to compute a new frequency threshold ϕ, and those unprocessed sequences with frequencies less than ϕ are skipped.

Efﬁcient and Effective KNN Sequence Search with

Approximate n-grams

Xiaoli Wang

Xiaofeng Ding

2,3

Anthony K.H. Tung

Zhenjie Zhang

Dept. of Computer Science

National University of Singapore

{xiaoli,atung}@comp.nus.edu.sg

Dept. of Computer Science

Huazhong University of Sci. & Tech.

xfding@hust.edu.cn

Dept. of Computer Science

University of South Australia

Advanced Digital Sciences Center

zhenjie@adsc.com.sg

ABSTRACT

In this paper, we address the problem of ﬁnding k-nearest

neighbors (KNN) in sequence databases using the edit dis-

tance. Unlike most existing works using short and exact n-

gram matchings together with a ﬁlter-and-reﬁne framework

for KNN sequence search, our new approach allows us to use

longer but approximate n-gram matchings as a basis of KN-

N candidates pruning. Based on this new idea, we devise

a pipeline framework over a two-level index for searching

KNN in the sequence database. By coupling this framework

together with several eﬃcient ﬁltering strategies, i.e. the

frequency queue and the well-known Combined Algorithm

(CA), our proposal brings various enticing advantages over

existing works, including 1) huge reduction on false positive

candidates to avoid large overheads on candidate veriﬁca-

tions; 2) progressive result update and early termination;

and 3) good extensibility to parallel computation. We con-

duct extensive experiments on three real datasets to verify

the superiority of the proposed framework.

1. INTRODUCTION

Given a query sequence, the goal of KNN sequence search

is to ﬁnd k sequences in the database that are most similar to

the query sequence. KNN search on sequences have applica-

tions in a variety of areas including DNA/protein sequence

search [13], approximate keyword search [1], and plagiarism

detection [17, 30].

Our study here is also motivated by the real application

on ebook social annotation systems. In our systems, a large

number of paragraphs are annotated and associated with

comments and discussions

. For those who own a physi-

cal copy of the book, our aim is to allow them to retrieve

these annotations into their mobile devices using query by

snapping. As shown in Figure 1, queries are generated by

http://readpeer.com

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee. Articles from this volume were invited to present

their results at The 40th International Conference on Very Large Data Bases,

September 1st - 5th 2014, Hangzhou, China.

Proceedings of the VLDB Endowment, Vol. 7, No. 1

Client

Query

Annotations

User Sever

Figure 1: An example of the book annotation search

users when they use mobile devices to snap a photo of page

in the book. The query photo is then processed by an op-

tical character recognition (OCR) program which extracts

the text from the photo as a sequence. Since the OCR pro-

gram might generate errors within the sequence, we need

to perform an approximate query against the paragraphs in

the server to retrieve those paragraphs that had been an-

notated. Since the range of error in such cases is hard to

determine, a k-nearest neighbor (KNN) search is natural-

ly preferred, avoiding the need to estimate how good the

results generated from the OCR are.

This paper uses edit distance to evaluate the similarity

b etween two sequences. Edit distance is commonly used

in similarity search on large sequence databases, due to it-

s robustness to typical errors in sequences like misspelling

[13]. Existing edit distance algorithms for sequence search

have focused on either approximate searching (e.g., [3, 9,

10, 16, 18, 20, 23]) or KNN similarity search [21, 28, 29].

Although range query has been extensively studied, KNN

search remains a challenging issue. Many eﬀorts on answer-

ing KNN search utilize the ﬁlter-and-reﬁne framework [21,

28, 29]. The main idea is to prune oﬀ candidates by utilizing

the number of exact matches on a set of n-grams that are

generated from the sequences. An n-gram is a contiguous

subsequence of a particular sequence (also called q-gram).

Although such approaches are eﬀective on short sequence

searches, they are less eﬀective if there is a need to process

sequences that are longer like a page of text in a book. In

this paper, we further investigate the KNN search problem

from the viewpoint of enhancing eﬃciency.

In this paper, we develop a novel search framework which

uses approximate n-grams as the ﬁltering signatures. This

allows us to use longer n-grams compared to exact matches

which in turn gives more accurate pruning since such match-

ing is less likely to be random. We introduce two novel ﬁl-

tering techniques based on approximate n-grams by relaxing

the ﬁltering conditions. To ensure eﬃciency, we employ sev-

eral strategies. First, we use a frequency queue (f-queue) to

buﬀer the frequency of the approximate n-grams to support

candidate selection. This can help to avoid frequent candi-

date veriﬁcation. Second, we develop a novel search strategy

by employing the paradigm of the CA method [6]. By using

the summation of gram edit distances as the aggregation

function, the CA strategy can enhance the KNN search by

avoiding access to sequences with high dissimilarity. Third,

we design a pipeline framework to support simple parallel

pro cessing. These strategies are implemented over a two-

level inverted index. In the upper-level index, n-grams that

are derived from the sequence database are stored in an in-

verted ﬁle with their references to the original sequences. In

the lower-level index, each distinct n-gram from the upper-

level is further decomposed into smaller sub-units, and in-

verted lists are constructed to store the references to the

upp er-level grams for each sub-unit. Based on the index,

the search framework has two steps.

In the ﬁrst step, given a query sequence and its n-grams,

similar n-grams within a range will be quickly returned us-

ing the lower-level index. In the second step, the n-grams

returned from the lower level can be automatically used as

the input to construct the sorted lists in the upper level.

With the sorted lists, our proposed ﬁltering strategies are

employed to enhance the search procedure. Our contribu-

tions in this paper are summarized as follows:

• We introduce novel bounds for sequence edit distance

based on approximate n-grams. These bounds oﬀer

new opportunities for improving pruning eﬀectiveness

in sequence matching.

• We propose a novel KNN sequence search framework

using several eﬃcient strategies. The f-queue support-

s our proposed ﬁltering techniques with a sequence

buﬀer for candidate selection. The well-known CA s-

trategy has an excellent property of early termination

for scanning the inverted lists, and the pip eline strat-

egy can eﬀectively make use of parallel processing to

speed up our search.

• We propose a pipeline search framework based on a

two-level inverted index. By adopting a carefully staged

processing that starts from searching at the lower-level

n-gram index to ending at the upper-level sorted list

processing, we are able to ﬁnd KNN for long sequences

in an easily parallelizable manner.

• We conduct a series of experiments to compare our

proposed ﬁltering strategies with existing methods. The

results show that our proposed ﬁltering techniques have

better pruning power, and the new ﬁltering strategies

can enhance existing ﬁltering techniques.

The rest of this paper is organized as follows. Section

2 discusses related studies. Section 3 provides preliminary

concepts and basic principles for the KNN search. Section 4

introduces the proposed ﬁltering techniques. Section 5 illus-

trates several eﬃcient strategies to support the KNN search.

Section 6 presents the pipeline search framework with a two-

level inverted index. We evaluate the proposed approaches

with experimental results in Section 7 and conclude the pa-

p er in Section 8.

2. RELATED WORK

Similarity query based on edit distance is a well-studied

problem (e.g., [12, 15, 26]). An extensive survey had been

conducted very early in [13]. Early algorithms are based

on online sequential search, and mainly focus on speed-

ing up the exact sequence edit distance (SED) computa-

tion. Among them, the most eﬃcient algorithm requires

O (|s |

/log|s|) time [12] for computing the SED, and only

O (τ |s|) time for testing if the SED is within some thresh-

old τ [29]. However, online search algorithms still suﬀer

from poor scalability in terms of string length or database

size since they need a full scan on the whole database. To

overcome this drawback, most recent works follow a ﬁlter-

and-reﬁne framework. Many indexing techniques have been

prop osed to prune oﬀ most of the sequences before verifying

the exact edit distances for a small set of candidates [14].

There are three main indexing ideas: enumerating, back-

tracking and partitioning.

The ﬁrst idea is introduced for supporting speciﬁc queries

when strings are very short or the edit distance threshold

is small (e.g., [2, 24]). It is clear that enumeration usually

have high space complexity and is often impractical in real

query systems.

The second idea is based on branch-and-bound techniques

on tree index structures. In [4, 22], a trie is used to index

all strings in a dictionary. With a trie, all shared preﬁxes

in the dictionary are collapsed into a single path, so they

can process them in the best order for computing the exact

SEDs. Sub-trie pruning is employed to enhance the eﬃ-

ciency of computing the edit distance. However, building

a trie for all strings is expensive in term of both time and

space complexity. In [29], a B

-tree index structure called

-tree is proposed to support similarity queries based on

edit distance. Although this index can be implemented on

most modern database systems, it suﬀers from poor query

p erformance since it has a very weak ﬁltering power.

To improve ﬁltering eﬀectiveness, most existing works em-

ploy the third idea that splits original strings into several s-

maller signatures to reduce the approximate search problem

to an exact signature match problem (e.g., [3, 7, 9, 10, 11,

16, 18, 20, 23, 27]). We further classify these methods based

on their preprocessing methods into the threshold-aware ap-

proaches and the threshold-free approaches. The threshold-

aware approaches have been developed mainly based on the

preﬁx-ﬁltering framework. Recent work in [23] performed a

detailed studies of these methods [11, 16, 23] and conclude

that the preﬁx-ﬁltering framework can be enhanced with an

adaptive framework. These methods typically work well on-

ly for a ﬁxed similarity threshold. If the threshold is not

ﬁxed, two choices exist. First, the index has to be built on-

line for each query with a distinct threshold. This could be

time consuming and always be impractical in real system-

s. Second, multiple indexes are constructed oﬄine for all

p ossible thresholds. This choice has high space complexity

esp ecially for databases with long sequences since there can

b e many distinct edit distance thresholds. The threshold-

free approaches generally employ various n-gram based sig-

natures. The basic idea is that if two strings are similar

they should share suﬃcient common signatures. Compared

to the threshold-aware approaches, these methods general-

ly have much less preprocessing time and space overhead

for storing indexes. However, if we ignore the preprocess-

ing phrase, these methods have been presented to have the

worse performance for supporting edit distance similarity

search [16]. This is because they often suﬀer from poor ﬁl-

tering eﬀectiveness through the use of loose bounds.

Although most of such approaches had been shown to be

eﬃcient for approximate searching with a predeﬁned thresh-

old, limited progress has been made for addressing the KNN

search problem. Existing eﬀorts utilize two kinds of index

mechanisms [21, 28, 29, 5]. The ﬁrst index mechanism is

adapted from inverted list based index [21, 28]. The KN-

N search algorithm employs the same intuition by selecting

candidates with suﬃcient number of common n-grams. The

diﬀerence between them is the list merging technique. In

[21], the MergeSkip algorithm is employed to reduce the in-

verted list processing time. A predeﬁned threshold based

algorithm is also proposed by repeating the approximate

string queries multiple times to support KNN search. In

[28], the basic length ﬁltering is used to improve list process-

ing. Another index mechanism is based on the tree struc-

ture [29, 5]. In [29], a B

-tree based index is proposed to

index database sequences based on some sequence orders.

The tree nodes are iteratively traversed to update the low-

er bound of edit distance and the nodes beyond the bound

are pruned. In the most recent work [5], an in-memory trie

structure is used to index strings and share computation-

s on common preﬁxes of strings. A range-based method is

prop osed by grouping the pivotal entries to avoid duplicated

computations in the dynamic programming matrix when the

edit distance is computed. Although such approaches are

eﬀective on the short sequence search, their performances

degrade for long sequences since the length of the common

preﬁx are relatively short for long sequences and the large

number of long, single branches in the trie bring about large

space and computation overhead.

3. PRELIMINARIES

Let Σ be a set of elements, e.g. a ﬁnite alphabet of char-

acters in a string database or an inﬁnite set of latitude and

longitude in a trajectory database. We use s to denote a

sequence in Σ

∗

of length |s|, s[i] to denote the ith element,

and s[i, j] to denote a subsequence of s from the ith elemen-

t to the jth element. The common notations used in the

rest of the paper are summarized in Table 1. In this paper,

we employ edit distance as the measure on the dissimilarity

b etween two sequences, which is formalized as follows.

Definition 1. (Sequence Edit Distance)(SED) Given t-

wo sequences s

and s

, the edit distance between them, de-

noted by λ(s

, s

), is the minimum number of primitive edit

operations (i.e., insertion, deletion, and substitution) on s

that is necessary for transforming s

into s

We focus on k-nearest neighbor (KNN) search based on

the edit distance, following the formal deﬁnition as below.

Problem 1. Given a query sequence q and a sequence

database D = {s

, s

, ..., s

|D|

}, ﬁnd k sequences {a

, a

, ..., a

}

in D, which are more similar to q than the other sequences,

that is, ∀s

∈ D\{a

(1 ≤ j ≤ k)}, λ(s

, q) ≥ λ(a

, q).

3.1 KNN Sequence Search Using N-grams

In this section, we aim to introduce important concepts

and principles of sequence similarity search using n-grams

which is a common technique exploited in existing studies.

Table 1: Notations

Notation Description

D the sequence database

q the query sequence

|s| the length of sequence s

s[i] the i

element of sequence s

the n-gram set of a sequence s

λ(s

, s

) the edit distance between two sequences s

and s

λ(g

, g

) the edit distance between two n-grams g

and g

µ(s

, s

) the gram mapping distance between two

sequences s

and s

ϕ the frequency threshold value of n-grams

k the k value for the KNN search

τ the edit distance threshold

τ(t) the threshold value computed by the ag-

gregation function in the CA method

η(τ, t, n) the number of n-grams aﬀected by τ edit

op erations with gram edit distance > t

Definition 2. (n-gram) Given a sequence s and a posi-

tive integer n, a positional n-gram of s is a pair (i, g), where

g is a subsequence of length n starting at the i

element, i.e.,

g = s[i, i + n − 1]. The set G (s, n) consists of all n-grams of

s, obtained by sliding a window of length n over sequence s.

In particular, there are |s| − n + 1 n-grams in G(s, n).

In this paper, we skip the positional information of the

n-grams. Such a simpliﬁed 5-gram set of a sequence intro-

duction, for example, is {intro, ntrod, trodu, roduc, oduct,

ducti, uctio, ction}. The n-gram set is useful in edit dis-

tance similarity evaluation, based on the following obser-

vation: if a sequence s

could be transformed to s

by τ

primitive edit operations, s

and s

must share at least

ϕ = (max{|s

|, |s

|} − n + 1) − n × τ common n-grams [18].

Algorithm 1 A Simple KNN Sequence Search Algorithm

Require: The n-gram lists L

for q, and k

1: Initialize a max-heap H using ﬁrst visited k sequences;

2: for L

∈ L

3: for all unprocessed s

∈ L

4: frequency[s

] + +;

5: τ = max{λ

|s ∈ H} ;

6: ϕ = max{|s

|, |q|} − n + 1 − n × τ;

7: if f requency[s

] ≥ ϕ then

8: Compute the edit distance λ(s

, q);

9: if λ(s

, q) < τ then

10: Up date and maintain the max-heap H;

11: Mark s

as a processed sequence;

12: Output the k sequences in H;

Inverted indexes on the n-grams of the sequences are com-

monly used, such that references to original locations of the

same n-gram are kept in a list structure. Algorithm 1 shows

a typical threshold-based algorithm using the inverted index

on the n-grams as well as an auxiliary heap structure. This

algorithm dynamically updates the frequency threshold us-

ing the maximum edit distance maintained in a max-heap

H (lines 6 - 7). The query performance dep ends on the eﬃ-

ciencies of two operations, the inverted list scan and the edit

distance computation for the candidate veriﬁcation (lines 3

- 11).

Algorithm 1 could be improved by using optimization s-

trategies, such as length ﬁltering [28] and MergeSkip [21].

The intuition behind length ﬁltering is as follow: if two se-

quences are within an edit distance of τ, their length d-

iﬀerence is no larger than τ . Therefore, the inverted list

scan is restricted to the sequences within the length con-

straint. Inverted lists are thus sorted in ascending order of

the sequence length. On the other hand, the MergeSkip s-

trategy preprocesses inverted lists such that the references

are sorted in ascending order of the sequence identiﬁcation

number. When the maximum entry in the max-heap H is

up dated, it is used to compute a new frequency threshold ϕ,

and those unprocessed sequences with frequencies less than

ϕ are skipped. As an example, in Figure 2, sequence no. 10

is ﬁrst visited and pushed to the top-1 heap. The temporal

frequency threshold is computed as ϕ = 3, and the candi-

date for next visit is sequence no. 35. In this way, sequences

20 and 30 are skipped as their frequencies are less than 3.

TopͲ1heap

…

100

…

100

Intermediate results

ID ʄ࢕

10 2 3

35 2 …

… … …

50 1 5

… … …

…

200

…

100

……

…

100

…

200

Pop10

Push50

Figure 2: Illustration of the MergeSkip strategy

Although such approaches may somehow improve the eﬃ-

ciency of list processing, they may have limited performance

since they are strictly relying on the eﬃcient processing of in-

verted lists. For example, the length ﬁltering can be useless

in a database where most sequences are around the same

length. In Figure 2, the top-1 heap is updated when se-

quence no. 50 is visited, the new frequency threshold is

ϕ = 5, and the next visiting candidate is sequence no. 45.

In this case, no sequence may be skipped. The reason is

that sequences from 35 to 45 are located in the grey area

may have b een processed as the frequencies of their matched

n-grams are larger than 3. As the frequency threshold is

a loose bound that can generate too many false positives,

the candidate veriﬁcation becomes the most time consum-

ing step.

4. NEW FILTERING THEORY

Due to the limited pruning eﬀectiveness of exact n-gram

matching, we aim to develop new theories for sequence search

ﬁltering by using approximate matching between n-grams

of the two sequences. This is motivated by the observations

that using exact n-gram matching will typically require n to

b e small (so that the probability of having exact matching

will not be too low) which will in turn lower the selectivi-

ty of the n-grams. By allowing approximate matching for

these n-grams, we can increase the size of n without com-

promising the chance of a matching taking place, thereby

increasing selectivity of the n-grams and reducing the length

of the inverted list to be scanned. As shown in Deﬁnition 2,

an n-gram is a subsequence of the original sequence. Con-

sequently, gram edit distance is computed as the sequence

edit distance between two n-grams.

Count ﬁltering is the ﬁrst pruning strategy we design

based on gram edit distance, It is an extension of the exist-

ing count ﬁltering on exact n-gram matchings. Basically, we

want to estimate the maximal number of n-grams modiﬁed

by τ edit operations such that the gram edit distance be-

tween the aﬀected n-gram and the queried n-gram is larger

than a certain value of t (t ≥ 0). This leads to the new

count ﬁltering using approximate n-grams, as is shown in

the following proposition.

Proposition 1. Consider two sequences s

and s

. If s

and s

are within an edit distance of τ, then s

and s

must

have at most η(τ, t, n) = max{1, n − 2 × t} + (n − t)(τ − 1)

n-grams with gram edit distance > t, where t < n.

Proof. Let t = 0. Then η(τ, 0, n) = max{1, n − 2 × 0} +

(n − 0) × (τ − 1) = n × τ . Intuitively, this holds because one

edit operation can mo dify at most n n-grams. Consequently,

τ edit operations can modify at most n × τ n-grams (i.e.,

there are at most n × τ n-grams between s

and s

with

gram edit distance > 0).

Let t ≥ 1. We ﬁrst analyze the eﬀect of edit operations

on the n-grams with certain gram edit distance (GED). We

show the ﬁrst edit operation in two cases: it is applied on the

ﬁrst or last n-1 n-grams, and it is applied into other positions

not within the ﬁrst or last n-1 n-grams. As shown in Figure

3, in Case 1, one edit operation is applied in the position

in the pink box. Two types of edit operations will aﬀect

n-grams to have diﬀerent distance distributions. Obviously,

one substitution will cause n n-grams to have GED = 1;

while one insertion or deletion will cause one new n-gram

and n-1 n-grams of various GEDs. Consequently, the upper

b ound value of η(τ, t, n) will cause at least 1 n-gram with

GED = n. We now show the distribution of the GEDs.

As shown in the ﬁgure, two 5-grams g

and g

have GED

= 1 in Figure 3(a). However, two 5-grams g

and g

can

have GED ≤ 2. Generally, one such operation can cause

at most n − 2 × t n-grams to have GED > t. Remember

that there are at least one new derived n-gram of GED = n.

Therefore, an upper bound on the number of aﬀected n-

grams with GED > t should be max{1, n − 2 × t}. In case

2, one edit operation is applied to the ﬁrst or last n-1 n-

grams. The total number of aﬀected n-grams, denoted by

′

, is less than n, and the number of aﬀected n-grams have

GED > t should be less than that of Case 1. It is obvious

that the number of aﬀected n-grams in Case 2 is less than

that in Case 1. It is indeed true that Case 1 can infer an

upp er bound value on the aﬀected n-gram number when the

insertion or deletion operation is applied.

We now show how the distribution of edit operations will

aﬀect the maximum number of n-grams with GED > t. Sup-

p ose E = {e

, e

, . . . , e

} is a series of edit operations that

is needed to transform one sequence into another sequence.

Supp ose the τ edit operations are evenly distributed in a se-

quence. That means no n -gram is simultaneously aﬀected by

multiple edit operations. In this case, the numb er of aﬀect-

ed n-grams can be maximized . As analyzed ab ove, one edit

op eration will aﬀect at most n− 2 × t n-grams to have GED

> t. This is the boundary case where the edit operation is

the ﬁrst or the last operation. It is clear that the number

of aﬀected n-grams with GED > t, on the left of the ﬁrst

edit operation and on the right of the last edit operation,

is at most max{1, n − 2 × t}. For the remaining τ − 1 edit

op erations, one new operation will cause n−t newly aﬀected

Case2

Asubstitution

Aninsertion

Case1

(a) (b) (c)

(e)(d) (f)

Aninsertion

Asubstitution

Figure 3: Eﬀect of edit operations on n-grams

Freq.

0 0

8 0

6 0

0 0

Freq.

2 1

8 1

7 1

0 1

Freq.

3 4

16 4

15 4

Candidates

20 0

30 2

Prune none

GED=0

GED=1

GED=2

Figure 4: An example of the count ﬁltering

n-grams ahead its previous edit position, as the boundary

p osition will be aﬀected only in this case. Consequently, the

maximum number of aﬀected n-grams with GED > t would

b e η(τ, t, n) = max{1, n − 2 × t} + (n − t)(τ − 1).

Lemma 1. Consider two sequences s

and s

. If s

and

are within an edit distance of τ, then s

and s

must share

at least ϕ

, s

) = |s|− n + 1 − η(τ, t, n) n-grams with gram

edit distance ≤ t. Here, |s| is equal to max{|s

|, |s

|}.

The proposed count ﬁltering oﬀers new opportunities to

improve the search performance as it has a stronger ﬁltering

ability. As is shown in Figure 4, no sequence is pruned

using the count ﬁltering with common n-grams of ϕ

= 0.

By using the count ﬁltering with n-grams of GED = 1,

sequence no. 40 can be pruned by ϕ

as its frequency (i.e.,

F req.) of n-grams with GED ≤ 1 is less than ϕ

. Similarly,

the sequence 10 is pruned by using the count ﬁltering of ϕ

Mapping ﬁltering is a more complicated pruning strat-

egy, but provides more eﬀective pruning based on the gram

edit distance. To begin with, we ﬁrst deﬁne the distance

b etween two multi-sets of n-grams.

Definition 3. (Gram Mapping Distance)(GMD) Given

two gram multi-sets G

and G

of s

and s

, respectively

with the same cardinality. The mapping distance between

and s

is deﬁned as the sum of distances of the optimal

mapping between their gram multi-sets, and is computed as

µ(s

, s

) = min



∈G

λ(g

, P (g

)), P : G

→ G

The computation of gram mapping distance is accom-

plished by ﬁnding an optimal mapping between two grams

multi-sets. Similar to the work in [25], we can construct a

weighted matrix for each pair of grams from two sequences,

and apply the Hungarian algorithm [8, 19]. Based on gram

mapping distance, we show how a tighter lower bound on

the edit distance between two sequences could be achieved.

Lemma 2. Given two sequences s

and s

. The gram

mapping distance µ(s

, s

) between s

and s

satisﬁes

µ(s

, s

) ≤ (3n − 2) × λ(s

, s

)

Proof. Let E = {e

, e

, . . . , e

} be a series of edit oper-

ations that is needed to transform s

into s

. Accordingly,

there is a set of sequences s

= M

→ M

→ . . . → M

= s

where M

i−1

→ M

indicates that M

is the derived sequence

from M

i−1

by performing e

for 1 ≤ i ≤ K. Assume there

are K

insertion operations, K

deletion operations and K

substitution op erations, then we have K = K

+ K

We analyze the detailed inﬂuence of each type of edit oper-

ation as follows.

Insertion operation: When a character is inserted into

the sequence M

i−1

, at most n n-grams are aﬀected. The

edit distance is less than 2 for (n − 1) n-grams , and n

for one newly inserted n-gram. Thus, we conclude that

µ(M

i−1

, M

) ≤ [2(n − 1) + n] = 3n − 2.

Deletion operation: When one character is deleted from

the sequence M

i−1

, thus a total number of n n-grams may be

aﬀected. The edit distance is less than 2 for (n−1) n-grams,

and n for one newly deleted n-gram. Thus, in the case of

deleting one character, µ(M

i−1

, M

) ≤ [2(n−1)+n] = 3n−2.

Substitution operation: When a character in sequence M

i−1

is substituted by another character, a total number of n n-

grams are aﬀected. Then, the edit distance for each aﬀected

n-gram is equal to 1, and thus we have µ(M

i−1

, M

) ≤ n.

By analyzing the eﬀect of the above three operations, we

conclude that GMD and SED have the following relation-

ship.

µ(s

, s

) ≤ (3n − 2) × K

+ (3n − 2) × K

+ n × K

≤ (3n − 2) × (K

+ K

)

≤ (3n − 2) × λ(s

, s

)

Lemma 2 naturally brings us a new lower bound esti-

mation method on the sequence edit distance. Given two

sequences s

and s

, and an edit distance threshold τ , if

µ(s

)

3n−2

> τ , then λ(s

, s

) > τ. While the bound is ef-

fective, it remains computational expensive if we directly

apply this bound for pre-pruning. In this work, we employ

this bound function to compute the aggregation value in the

CA ﬁltering algorithms. That is, we use the summation of

gram edit distances as the aggregation function, instead of

directly computing the mapping distance. We will intro-

duce new implementing ﬁltering strategies and algorithmic

frameworks to make these theories practical.

5. FILTERING ALGORITHMS

Based on the ﬁltering theories derived in the previous sec-

tion, we introduce new algorithms to support eﬃcient ﬁlter-

ing. Given a query sequence q, we assume that there are

existing inverted lists that support eﬃcient search on the

n-grams under speciﬁed edit distance constraint, as shown

in Figure 5 and 6 with L

= {L

, L

, ..., L

|q|−n

Efficient and effective KNN sequence search with approximate n-grams

Figures

Citations

String similarity search and join: a survey

Approximate keyword search in semantic trajectory database

Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search

A Transformation-Based Framework for KNN Set Similarity Search

A unified framework for string similarity search with edit-distance constraint

References

The Hungarian method for the assignment problem

The Hungarian Method for the Assignment Problem.

A guided tour to approximate string matching

Optimal aggregation algorithms for middleware

A faster algorithm computing string edit distances

Related Papers (5)

Efficient Merging and Filtering Algorithms for Approximate String Searches

A Primitive Operator for Similarity Joins in Data Cleaning

Approximate String Joins in a Database (Almost) for Free

Efficient similarity joins for near duplicate detection

Scaling up all pairs similarity search

Frequently Asked Questions (16)

Q1. What contributions have the authors mentioned in the paper "Efficient and effective knn sequence search with approximate n-grams" ?

Q2. What future works have the authors mentioned in the paper "Efficient and effective knn sequence search with approximate n-grams" ?

Q3. What is the purpose of the f-queue?

Q4. What is the common use of edit distance?

Q5. How do the authors avoid having large overhead on list processing?

Q6. What is the way to prune off all the prefixes in a dictionary?

Q7. What is the frequency threshold used to update the max-heap H?

Q8. What is the smallest score that can be obtained for unseen elements?

Q9. What is the CA strategy used to terminate the whole process?

Q10. What is the way to reduce the candidate size?

Q11. What is the edit distance between s1 and s2?

Q12. How does the pipeline approach solve the problem of k-nearest neighbor sequence search?

Q13. What is the main reason why the bed-tree index is not implemented?

Q14. How many times can Flamingo run a range query?

Q15. How does the algorithm update the frequency threshold?

Q16. What is the intuition behind length filtering?