Efficient and effective KNN sequence search with approximate n-grams
read more
Citations
String similarity search and join: a survey
Approximate keyword search in semantic trajectory database
Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search
A Transformation-Based Framework for KNN Set Similarity Search
A unified framework for string similarity search with edit-distance constraint
References
The Hungarian method for the assignment problem
The Hungarian Method for the Assignment Problem.
A guided tour to approximate string matching
Optimal aggregation algorithms for middleware
A faster algorithm computing string edit distances
Related Papers (5)
Frequently Asked Questions (16)
Q2. What future works have the authors mentioned in the paper "Efficient and effective knn sequence search with approximate n-grams" ?
As a conclusion, their proposed filtering strategies show excellent performance on the KNN search, and the pipeline framework is easy to extend to parallel computation.
Q3. What is the purpose of the f-queue?
The f-queue can be used to improve the performance of existing algorithms based on the length filtering or the MergeSkip strategy.
Q4. What is the common use of edit distance?
Edit distance is commonly used in similarity search on large sequence databases, due to its robustness to typical errors in sequences like misspelling [13].
Q5. How do the authors avoid having large overhead on list processing?
To avoid having large overhead on list processing, the authors use the CA based strategy [6] and use the summation of gram edit distances as the aggregation function.
Q6. What is the way to prune off all the prefixes in a dictionary?
With a trie, all shared prefixes in the dictionary are collapsed into a single path, so they can process them in the best order for computing the exact SEDs.
Q7. What is the frequency threshold used to update the max-heap H?
When the maximum entry in the max-heap H is updated, it is used to compute a new frequency threshold ϕ, and those unprocessed sequences with frequencies less than ϕ are skipped.
Q8. What is the smallest score that can be obtained for unseen elements?
When the authors do sorted access to the list L4, each value of ti with i = 0, 1, .., 4 is set to be equal to 2 as no entry has distance of 1, and 2 is the smallest score that can be obtained for unseen elements.
Q9. What is the CA strategy used to terminate the whole process?
The CA strategy is used to terminate the whole process if the CA threshold value of the gram edit distance summation is larger than the temporary threshold computed from the top-k heap.
Q10. What is the way to reduce the candidate size?
The results indicate that combining the MergeSkip with the length filter can help to reduce the candidate size and improve the query performance.
Q11. What is the edit distance between s1 and s2?
Given two sequences s1 and s2, the edit distance between them, denoted by λ(s1, s2), is the minimum number of primitive edit operations (i.e., insertion, deletion, and substitution) on s1 that is necessary for transforming s1 into s2.
Q12. How does the pipeline approach solve the problem of k-nearest neighbor sequence search?
While existing approaches often suffer from poor filtering power and low query performance when sequences in the database are long, the authors tackle the problem by designing a novel file-and-refine pipeline approach utilizing approximate n-gram matchings.
Q13. What is the main reason why the bed-tree index is not implemented?
Although this index can be implemented on most modern database systems, it suffers from poor query performance since it has a very weak filtering power.
Q14. How many times can Flamingo run a range query?
When the k value is as small as 1, Flamingo can run efficiently as it only needs to execute a range query once to obtain the top-1 result.
Q15. How does the algorithm update the frequency threshold?
This algorithm dynamically updates the frequency threshold using the maximum edit distance maintained in a max-heap H (lines 6 - 7).
Q16. What is the intuition behind length filtering?
The intuition behind length filtering is as follow: if two sequences are within an edit distance of τ , their length difference is no larger than τ .