Fast error-tolerant search on very large texts

doi:10.1145/1529282.1529669

Open AccessProceedings ArticleDOI

Fast error-tolerant search on very large texts

Marjan Celikik, +1 more

- pp 1724-1731

Chats0

TLDR

This work combines various ideas from the large body of literature on approximate string searching and spelling correction techniques to a new algorithm for the spelling variants clustering problem that is both accurate and very efficient in time and space.

Abstract:

We consider the following spelling variants clustering problem: Given a list of distinct words, called lexicon, compute (possibly overlapping) clusters of words which are spelling variants of each other. This problem naturally arises in the context of error-tolerant full-text search of the following kind: For a given query, return not only documents matching the query words exactly but also those matching their spelling variants. This is the inverse of the well-known "Did you mean: … ?" web search engine feature, where the error tolerance is on the side of the query, and not on the side of the documents.We combine various ideas from the large body of literature on approximate string searching and spelling correction techniques to a new algorithm for the spelling variants clustering problem that is both accurate and very efficient in time and space. Our largest lexicon, containing roughly 10 million words, can be processed in about 16 minutes on a standard PC using 10 MB of additional space. This beats the previously best scheme by a factor of two in running time and by a factor of more than ten in space usage. We have integrated our algorithms into the CompleteSearch engine in a way that achieves error-tolerant search without significant blowup in neither index size nor query processing time.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences

Gonzalo Navarro

- 01 Mar 2014 -

ACM Computing Surveys

TL;DR: In this article, the authors cover the recent research in extending the document retrieval techniques to a broader class of sequence collections and uncover a rich world of relations between document retrieval challenges and fundamental problems on trees, strings, range queries, discrete geometry, and other areas.

...read moreread less

Journal ArticleDOI

Trie-join: a trie-based method for efficient string similarity joins

Jianhua Feng, +2 more

TL;DR: This paper designs efficient trie-join algorithms and pruning techniques to achieve high performance and shows that these algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings.

...read moreread less

Journal ArticleDOI

Efficient fuzzy full-text type-ahead search

Guoliang Li, +3 more

TL;DR: This paper studies a new information-access paradigm, called “type-ahead search” in which the system searches the underlying data “on the fly” as the user types in query keywords, and develops novel techniques to support fuzzy search by allowing mismatches between query keywords and answers.

...read moreread less

Book ChapterDOI

Efficient similarity search in very large string sets

Dandy Fenz, +4 more

TL;DR: The State Set Index (SSI) is introduced, based on a trie (prefix index) that is interpreted as a nondeterministic finite automaton, and implements a novel state labeling strategy making the index highly space-efficient.

...read moreread less

Journal ArticleDOI

Managing misspelled queries in IR applications

Jesús Vilares, +2 more

- 01 Mar 2011 -

Information Processing and Management

TL;DR: A comparative analysis of the efficacy of two possible strategies based on the use of character n-grams as the basic indexing unit, which guarantees the robustness of the information retrieval process whilst at the same time eliminating the need for a specific query correction stage.

...read moreread less

References

PDF

Open Access

More filters

Journal ArticleDOI

Techniques for automatically correcting words in text

Karen Kukich

- 01 Dec 1992 -

ACM Computing Surveys

TL;DR: Research aimed at correcting words in text has focused on three progressively more difficult problems: nonword error detection; (2) isolated-word error correction; and (3) context-dependent work correction, which surveys documented findings on spelling error patterns.

...read moreread less

Journal ArticleDOI

Searching in metric spaces

Edgar Chávez, +3 more

- 01 Sep 2001 -

ACM Computing Surveys

TL;DR: A unified view of all the known proposals to organize metric spaces, so as to be able to understand them under a common framework, and presents a quantitative definition of the elusive concept of "intrinsic dimensionality".

...read moreread less

Journal ArticleDOI

Learning string-edit distance

Eric Sven Ristad, +1 more

- 01 May 1998 -

IEEE Transactions on Pattern Analysis an...

TL;DR: The stochastic model allows us to learn a string-edit distance function from a corpus of examples and is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.

...read moreread less

Proceedings ArticleDOI

Scaling up all pairs similarity search

Roberto J. Bayardo, +2 more

TL;DR: This work proposes a simple algorithm based on novel indexing and optimization strategies that solves the problem of finding all pairs of vectors whose similarity score is above a given threshold without relying on approximation methods or extensive parameter tuning.

...read moreread less

Proceedings ArticleDOI

A Primitive Operator for Similarity Joins in Data Cleaning

Surajit Chaudhuri, +2 more

TL;DR: This paper proposes a new primitive operator which can be used as a foundation to implement similarity joins according to a variety of popular string similarity functions, and notions of similarity which go beyond textual similarity.

...read moreread less

Fast error-tolerant search on very large texts

Citations

Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences

Trie-join: a trie-based method for efficient string similarity joins

Efficient fuzzy full-text type-ahead search

Efficient similarity search in very large string sets

Managing misspelled queries in IR applications

References

Techniques for automatically correcting words in text

Searching in metric spaces

Learning string-edit distance

Scaling up all pairs similarity search

A Primitive Operator for Similarity Joins in Data Cleaning

Related Papers (5)

The CompleteSearch Engine: Interactive, Efficient, and Towards IR & DB integration

Type less, find more: fast autocompletion search with a succinct index

Efficient interactive fuzzy keyword search

Tries for approximate string matching

Search improvements for electronic spelling machine