Open AccessProceedings Article
Using an N-Gram-based document representation with a vector processing retrieval model
William B. Cavnar
- Iss: 500225, pp 269-277
TLDR
This work combines a vector processing model for documents and queries, but using N-gram frequencies as the basis for the vector element values instead of more traditional term frequencies, which provides good retrieval performance on the TREC-1 andTREC-2 tests without the need for any kind of word stemming or stopword removal.Abstract:
N-gram based representations for documents have several distinct advantages for various document processing tasks. First, they provide a more robust representation in the face of grammatical and typographical errors in the documents. Secondly, N-gram representations require no linguistic preparations such as word-stemming or stopword removal. Thus they are ideal in situations requiring multi-language operations. Vector processing retrieval models also have some unique advantages for information retrieval tasks. In particular, they provide a simple, uniform representation for documents and queries, and an intuitively appealing document similarity measure. Also, modern vector space models have good retrieval performance characteristics. In this work, we combine these two ideas by using a vector processing model for documents and queries, but using N-gram frequencies as the basis for the vector element values instead of more traditional term frequencies. The resulting system provides good retrieval performance on the TREC-1 and TREC-2 tests without the need for any kind of word stemming or stopword removal. We also have begun testing the system on Spanish language documents.read more
Citations
More filters
Proceedings Article
Text Classification using String Kernels
TL;DR: In this article, an inner product in the feature space consisting of all subsequences of length k was introduced for comparing two text documents, where a subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously.
Journal ArticleDOI
Text classification using string kernels
TL;DR: A novel kernel is introduced for comparing two text documents consisting of an inner product in the feature space consisting of all subsequences of length k, which can be efficiently evaluated by a dynamic programming technique.
Patent
Multiple engine information retrieval and visualization system
Kevin L. Fox,Ophir Frieder,Margaret M. Knepper,Robert A. Killam,Joseph M. Nemethy,Gregory J. Cusick,Eric J. Snowberg +6 more
TL;DR: In this paper, an information retrieval and visualization system utilizes multiple search engines for retrieving documents from a document database based upon user input queries, including an n-gram search engine and a vector space model search engine using a neural network training algorithm.
Journal ArticleDOI
Character N -Gram Tokenization for European Language Text Retrieval
Paul McNamee,James Mayfield +1 more
TL;DR: It is demonstrated empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages and is a good choice for those languages, and the increased storage and time requirements of the technique.
Journal ArticleDOI
Automatic text summarization: A comprehensive survey
TL;DR: This research provides a comprehensive survey for the researchers by presenting the different aspects of ATS: approaches, methods, building blocks, techniques, datasets, evaluation methods, and future research directions.
References
More filters
N-gram-based text categorization
W.B. Cavnar,John M. Trenkle +1 more
TL;DR: An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.
Journal ArticleDOI
Parallel text search methods
Gerard Salton,Chris Buckley +1 more
TL;DR: A comparison of recently proposed parallel text search methods to alternative available search strategies that use serial processing machines suggests parallel methods do not provide large-scale gains in either retrieval effectiveness or efficiency.
Proceedings Article
N-gram-based text filtering for TREC-2
TL;DR: An experimental text filtering system that uses N-gram-based matching for document retrieval and routing tasks, pointing the way for several types of enhancements, both for speed and effectiveness.
Text retrieval using the vector processing model
Gerard Salton,James Allan +1 more
TL;DR: A collection of 46,000 documents from the Federal Register is used as a test database to demonstrate the usefulness of vector processing methods and to illustrate the text analysis and retrieval operations.