scispace - formally typeset
Journal ArticleDOI

Understanding inverse document frequency: on theoretical arguments for IDF

Stephen Robertson
- 01 Oct 2004 - 
- Vol. 60, Iss: 5, pp 503-520
TLDR
It is shown that the Information Theory approaches are problematic, but that there are good theoretical justifications of both IDF and TF*IDF in the traditional probabilistic model of information retrieval.
Abstract
The term‐weighting function known as IDF was proposed in 1972, and has since been extremely widely used, usually as part of a TF*IDF function. It is often described as a heuristic, and many papers have been written (some based on Shannon's Information Theory) seeking to establish some theoretical basis for it. Some of these attempts are reviewed, and it is shown that the Information Theory approaches are problematic, but that there are good theoretical justifications of both IDF and TF*IDF in the traditional probabilistic model of information retrieval.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

CIDEr: Consensus-based image description evaluation

TL;DR: A novel paradigm for evaluating image descriptions that uses human consensus is proposed and a new automated metric that captures human judgment of consensus better than existing metrics across sentences generated by various sources is evaluated.
Book

Search Engines: Information Retrieval in Practice

TL;DR: This text provides the background and tools needed to evaluate, compare and modify search engines and numerous programming exercises make extensive use of Galago, a Java-based open source search engine.
Journal ArticleDOI

Interpreting TF-IDF term weights as making relevance decisions

TL;DR: A novel probabilistic retrieval model forms a basis to interpret the TF-IDF term weights as making relevance decisions, and it is shown that the term-frequency factor of the ranking formula can be rendered into different term- frequency factors of existing retrieval systems.
Proceedings ArticleDOI

Location-based and preference-aware recommendation using sparse geo-social networking data

TL;DR: A location-based and preference-aware recommender system that offers a particular user a set of venues within a geospatial range with the consideration of both: user preferences and social opinions, which are automatically learned from her location history.
Journal ArticleDOI

A Comprehensive Survey of Deep Learning for Image Captioning

TL;DR: A comprehensive review of deep learning-based image captioning techniques can be found in this article, where the authors discuss the foundation of the techniques to analyze their performances, strengths, and limitations.
References
More filters
Journal ArticleDOI

A mathematical theory of communication

TL;DR: This final installment of the paper considers the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now.
Journal ArticleDOI

A statistical interpretation of term specificity and its application in retrieval

TL;DR: It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms.
Journal ArticleDOI

A language modeling approach to information retrieval

TL;DR: It will be shown that probabilistic methods can be used to predict topic changes in the context of the task of new event detection and provide further proof of concept for the use of language models for retrieval tasks.
Book

Relevance weighting of search terms

TL;DR: This paper examines statistical techniques for exploiting relevance information to weight search terms using information about the distribution of index terms in documents in general and shows that specific weighted search methods are implied by a general probabilistic theory of retrieval.
Related Papers (5)