Using an N-Gram-based document representation with a vector processing retrieval model

Open AccessProceedings Article

Using an N-Gram-based document representation with a vector processing retrieval model

- Iss: 500225, pp 269-277

TLDR

This work combines a vector processing model for documents and queries, but using N-gram frequencies as the basis for the vector element values instead of more traditional term frequencies, which provides good retrieval performance on the TREC-1 andTREC-2 tests without the need for any kind of word stemming or stopword removal.

Abstract:

N-gram based representations for documents have several distinct advantages for various document processing tasks. First, they provide a more robust representation in the face of grammatical and typographical errors in the documents. Secondly, N-gram representations require no linguistic preparations such as word-stemming or stopword removal. Thus they are ideal in situations requiring multi-language operations. Vector processing retrieval models also have some unique advantages for information retrieval tasks. In particular, they provide a simple, uniform representation for documents and queries, and an intuitively appealing document similarity measure. Also, modern vector space models have good retrieval performance characteristics. In this work, we combine these two ideas by using a vector processing model for documents and queries, but using N-gram frequencies as the basis for the vector element values instead of more traditional term frequencies. The resulting system provides good retrieval performance on the TREC-1 and TREC-2 tests without the need for any kind of word stemming or stopword removal. We also have begun testing the system on Spanish language documents.

Using an N-Gram-based document representation with a vector processing retrieval model

Citations

Text Classification using String Kernels

Text classification using string kernels

Multiple engine information retrieval and visualization system

Character N -Gram Tokenization for European Language Text Retrieval

Automatic text summarization: A comprehensive survey

References

N-gram-based text categorization

Parallel text search methods

N-gram-based text filtering for TREC-2

Text retrieval using the vector processing model

Related Papers (5)

A vector space model for automatic indexing

Term Weighting Approaches in Automatic Text Retrieval

An algorithm for suffix stripping

Introduction to Modern Information Retrieval

N-gram-based text categorization