A vector space model for automatic indexing

doi:10.1145/361219.361220

Open AccessJournal ArticleDOI

A vector space model for automatic indexing

Gerard Salton, +2 more

- 01 Nov 1975 -

Communications of The ACM

- Vol. 18, Iss: 11, pp 613-620

Chats0

TLDR

An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents, demonstating the usefulness of the model.

Abstract:

In a document retrieval, or other pattern matching environment where stored entities (documents) are compared with each other or with incoming patterns (search requests), it appears that the best indexing (property) space is one where each entity lies as far away from the others as possible; in these circumstances the value of an indexing system may be expressible as a function of the density of the object space; in particular, retrieval performance may correlate inversely with space density. An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents. Typical evaluation results are shown, demonstating the usefulness of the model.

Content maybe subject to copyright Report

HTML Viewer

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Machine learning in automated text categorization

Fabrizio Sebastiani

- 01 Mar 2002 -

ACM Computing Surveys

TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

...read moreread less

Software Framework for Topic Modelling with Large Corpora

Radim Řehůřek, +1 more

TL;DR: This work describes a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion, and implements several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation in a way that makes them completely independent of the training corpus size.

...read moreread less

Journal ArticleDOI

From frequency to meaning: vector space models of semantics

Peter D. Turney, +1 more

- 01 Jan 2010 -

Journal of Artificial Intelligence Resea...

TL;DR: The goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs, and to provide pointers into the literature for those who are less familiar with the field.

...read moreread less

Journal ArticleDOI

A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval

ChengXiang Zhai, +1 more

TL;DR: This paper examines the sensitivity of retrieval performance to the smoothing parameters and compares several popular smoothing methods on different test collection.

...read moreread less

Proceedings Article

Text Classification using String Kernels

Huma Lodhi, +3 more

TL;DR: In this article, an inner product in the feature space consisting of all subsequences of length k was introduced for comparing two text documents, where a subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

A statistical interpretation of term specificity and its application in retrieval

Karen Sparck Jones

- 01 Jan 1972 -

Journal of Documentation

TL;DR: It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms.

...read moreread less

Journal ArticleDOI

On the Specification of Term Values in Automatic Indexing

Gerard Salton, +1 more

- 01 Apr 1973 -

Journal of Documentation

TL;DR: It is shown that the standard theories for the specification of term values (or weights) are not adequate, and new techniques are introduced for the assignment of weights to index terms, based on the characteristics of individual document collections.

...read moreread less

MonographDOI

Theory of Indexing

Gerard Salton

Proceedings Article

Contribution to the Theory of Indexing

Gerard Salton, +2 more

TL;DR: An attempt is made to characterize the usefulness of terms occurring in stored documents and user queries as a function of their frequency characteristics across the documents of a collection, and an indexing theory is described based on term frequency considerations.

...read moreread less