scispace - formally typeset
Open AccessProceedings ArticleDOI

Fast error-tolerant search on very large texts

Marjan Celikik, +1 more
- pp 1724-1731
Reads0
Chats0
TLDR
This work combines various ideas from the large body of literature on approximate string searching and spelling correction techniques to a new algorithm for the spelling variants clustering problem that is both accurate and very efficient in time and space.
Abstract
We consider the following spelling variants clustering problem: Given a list of distinct words, called lexicon, compute (possibly overlapping) clusters of words which are spelling variants of each other. This problem naturally arises in the context of error-tolerant full-text search of the following kind: For a given query, return not only documents matching the query words exactly but also those matching their spelling variants. This is the inverse of the well-known "Did you mean: … ?" web search engine feature, where the error tolerance is on the side of the query, and not on the side of the documents.We combine various ideas from the large body of literature on approximate string searching and spelling correction techniques to a new algorithm for the spelling variants clustering problem that is both accurate and very efficient in time and space. Our largest lexicon, containing roughly 10 million words, can be processed in about 16 minutes on a standard PC using 10 MB of additional space. This beats the previously best scheme by a factor of two in running time and by a factor of more than ten in space usage. We have integrated our algorithms into the CompleteSearch engine in a way that achieves error-tolerant search without significant blowup in neither index size nor query processing time.

read more

Citations
More filters
Journal ArticleDOI

Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences

TL;DR: In this article, the authors cover the recent research in extending the document retrieval techniques to a broader class of sequence collections and uncover a rich world of relations between document retrieval challenges and fundamental problems on trees, strings, range queries, discrete geometry, and other areas.
Journal ArticleDOI

Trie-join: a trie-based method for efficient string similarity joins

TL;DR: This paper designs efficient trie-join algorithms and pruning techniques to achieve high performance and shows that these algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings.
Journal ArticleDOI

Efficient fuzzy full-text type-ahead search

TL;DR: This paper studies a new information-access paradigm, called “type-ahead search” in which the system searches the underlying data “on the fly” as the user types in query keywords, and develops novel techniques to support fuzzy search by allowing mismatches between query keywords and answers.
Book ChapterDOI

Efficient similarity search in very large string sets

TL;DR: The State Set Index (SSI) is introduced, based on a trie (prefix index) that is interpreted as a nondeterministic finite automaton, and implements a novel state labeling strategy making the index highly space-efficient.
Journal ArticleDOI

Managing misspelled queries in IR applications

TL;DR: A comparative analysis of the efficacy of two possible strategies based on the use of character n-grams as the basic indexing unit, which guarantees the robustness of the information retrieval process whilst at the same time eliminating the need for a specific query correction stage.
References
More filters
Journal ArticleDOI

Techniques for automatically correcting words in text

Karen Kukich
TL;DR: Research aimed at correcting words in text has focused on three progressively more difficult problems: nonword error detection; (2) isolated-word error correction; and (3) context-dependent work correction, which surveys documented findings on spelling error patterns.
Journal ArticleDOI

Searching in metric spaces

TL;DR: A unified view of all the known proposals to organize metric spaces, so as to be able to understand them under a common framework, and presents a quantitative definition of the elusive concept of "intrinsic dimensionality".
Journal ArticleDOI

Learning string-edit distance

TL;DR: The stochastic model allows us to learn a string-edit distance function from a corpus of examples and is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.
Proceedings ArticleDOI

Scaling up all pairs similarity search

TL;DR: This work proposes a simple algorithm based on novel indexing and optimization strategies that solves the problem of finding all pairs of vectors whose similarity score is above a given threshold without relying on approximation methods or extensive parameter tuning.
Proceedings ArticleDOI

A Primitive Operator for Similarity Joins in Data Cleaning

TL;DR: This paper proposes a new primitive operator which can be used as a foundation to implement similarity joins according to a variety of popular string similarity functions, and notions of similarity which go beyond textual similarity.