scispace - formally typeset
Open AccessProceedings Article

Using an N-Gram-based document representation with a vector processing retrieval model

William B. Cavnar
- Iss: 500225, pp 269-277
TLDR
This work combines a vector processing model for documents and queries, but using N-gram frequencies as the basis for the vector element values instead of more traditional term frequencies, which provides good retrieval performance on the TREC-1 andTREC-2 tests without the need for any kind of word stemming or stopword removal.
Abstract
N-gram based representations for documents have several distinct advantages for various document processing tasks. First, they provide a more robust representation in the face of grammatical and typographical errors in the documents. Secondly, N-gram representations require no linguistic preparations such as word-stemming or stopword removal. Thus they are ideal in situations requiring multi-language operations. Vector processing retrieval models also have some unique advantages for information retrieval tasks. In particular, they provide a simple, uniform representation for documents and queries, and an intuitively appealing document similarity measure. Also, modern vector space models have good retrieval performance characteristics. In this work, we combine these two ideas by using a vector processing model for documents and queries, but using N-gram frequencies as the basis for the vector element values instead of more traditional term frequencies. The resulting system provides good retrieval performance on the TREC-1 and TREC-2 tests without the need for any kind of word stemming or stopword removal. We also have begun testing the system on Spanish language documents.

read more

Citations
More filters
Proceedings Article

Text Classification using String Kernels

TL;DR: In this article, an inner product in the feature space consisting of all subsequences of length k was introduced for comparing two text documents, where a subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously.
Journal ArticleDOI

Text classification using string kernels

TL;DR: A novel kernel is introduced for comparing two text documents consisting of an inner product in the feature space consisting of all subsequences of length k, which can be efficiently evaluated by a dynamic programming technique.
Patent

Multiple engine information retrieval and visualization system

TL;DR: In this paper, an information retrieval and visualization system utilizes multiple search engines for retrieving documents from a document database based upon user input queries, including an n-gram search engine and a vector space model search engine using a neural network training algorithm.
Journal ArticleDOI

Character N -Gram Tokenization for European Language Text Retrieval

TL;DR: It is demonstrated empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages and is a good choice for those languages, and the increased storage and time requirements of the technique.
Journal ArticleDOI

Automatic text summarization: A comprehensive survey

TL;DR: This research provides a comprehensive survey for the researchers by presenting the different aspects of ATS: approaches, methods, building blocks, techniques, datasets, evaluation methods, and future research directions.
References
More filters

N-gram-based text categorization

TL;DR: An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.
Journal ArticleDOI

Parallel text search methods

TL;DR: A comparison of recently proposed parallel text search methods to alternative available search strategies that use serial processing machines suggests parallel methods do not provide large-scale gains in either retrieval effectiveness or efficiency.
Proceedings Article

N-gram-based text filtering for TREC-2

TL;DR: An experimental text filtering system that uses N-gram-based matching for document retrieval and routing tasks, pointing the way for several types of enhancements, both for speed and effectiveness.

Text retrieval using the vector processing model

Gerard Salton, +1 more
TL;DR: A collection of 46,000 documents from the Federal Register is used as a test database to demonstrate the usefulness of vector processing methods and to illustrate the text analysis and retrieval operations.