Relevance weighting of search terms

doi:10.1002/ASI.4630270302

Home
/
Papers
/
Relevance weighting of search terms

Journal Article•DOI•

Relevance weighting of search terms

Stephen Robertson¹, K. Sparck Jones²•Institutions (2)

University College London¹, University of Cambridge²

01 May 1976-Journal of the Association for Information Science and Technology (John Wiley & Sons, Ltd)-Vol. 27, Iss: 3, pp 129-146

TL;DR: In this article, a series of relevance weighting functions is derived and is justified by theoretical considerations, in particular, it is shown that specific weighted search methods are implied by a general probabilistic theory of retrieval.

read less

Abstract: This paper examines statistical techniques for exploiting relevance information to weight search terms. These techniques are presented as a natural extension of weighting methods using information about the distribution of index terms in documents in general. A series of relevance weighting functions is derived and is justified by theoretical considerations. In particular, it is shown that specific weighted search methods are implied by a general probabilistic theory of retrieval. Different applications of relevance weighting are illustrated by experimental results for test collections.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Term Weighting Approaches in Automatic Text Retrieval

[...]

Gerard Salton¹, Chris Buckley¹•Institutions (1)

Cornell University¹

01 Aug 1988-Information Processing and Management

TL;DR: This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.

...read moreread less

Abstract: The experimental evidence accumulated over the past 20 years indicates that textindexing systems based on the assignment of appropriately weighted single terms produce retrieval results that are superior to those obtainable with other more elaborate text representations. These results depend crucially on the choice of effective term weighting systems. This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.

...read moreread less

9,460 citations

Journal Article•DOI•

Machine learning in automated text categorization

[...]

Fabrizio Sebastiani

01 Mar 2002-ACM Computing Surveys

TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

...read moreread less

Abstract: The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

...read moreread less

7,539 citations

Proceedings Article•

A comparison of event models for naive bayes text classification

[...]

Andrew McCallum, Kamal Nigam¹•Institutions (1)

Carnegie Mellon University¹

01 Jan 1998

TL;DR: It is found that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi -variateBernoulli model at any vocabulary size.

...read moreread less

Abstract: Recent work in text classification has used two different first-order probabilistic models for classification, both of which make the naive Bayes assumption. Some use a multi-variate Bernoulli model, that is, a Bayesian Network with no dependencies between words and binary word features (e.g. Larkey and Croft 1996; Koller and Sahami 1997). Others use a multinomial model, that is, a uni-gram language model with integer word counts (e.g. Lewis and Gale 1994; Mitchell 1997). This paper aims to clarify the confusion by describing the differences and details of these two models, and by empirically comparing their classification performance on five text corpora. We find that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi-variate Bernoulli model at any vocabulary size.

...read moreread less

3,601 citations

Journal Article•DOI•

Text Classification from Labeled and Unlabeled Documents using EM

[...]

Kamal Nigam¹, Andrew McCallum², Sebastian Thrun¹, Tom M. Mitchell¹•Institutions (2)

Carnegie Mellon University¹, Jordan University of Science and Technology²

01 May 2000-Machine Learning

TL;DR: This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents, and presents two extensions to the algorithm that improve classification accuracy under these conditions.

...read moreread less

Abstract: This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve classification accuracy under these conditions: (1) a weighting factor to modulate the contribution of the unlabeled data, and (2) the use of multiple mixture components per class. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.

...read moreread less

3,123 citations

Journal Article•DOI•

A language modeling approach to information retrieval

[...]

Jay Ponte¹, W. Bruce Croft¹•Institutions (1)

University of Massachusetts Amherst¹

01 Aug 1998

TL;DR: It will be shown that probabilistic methods can be used to predict topic changes in the context of the task of new event detection and provide further proof of concept for the use of language models for retrieval tasks.

...read moreread less

Abstract: In today's world, there is no shortage of information. However, for a specific information need, only a small subset of all of the available information will be useful. The field of information retrieval (IR) is the study of methods to provide users with that small subset of information relevant to their needs and to do so in a timely fashion. Information sources can take many forms, but this thesis will focus on text based information systems and investigate problems germane to the retrieval of written natural language documents. Central to these problems is the notion of "topic." In other words, what are documents about? However, topics depend on the semantics of documents and retrieval systems are not endowed with knowledge of the semantics of natural language. The approach taken in this thesis will be to make use of probabilistic language models to investigate text based information retrieval and related problems. One such problem is the prediction of topic shifts in text, the topic segmentation problem. It will be shown that probabilistic methods can be used to predict topic changes in the context of the task of new event detection. Two complementary sets of features are studied individually and then combined into a single language model. The language modeling approach allows this problem to be approached in a principled way without complex semantic modeling. Next, the problem of document retrieval in response to a user query will be investigated. Models of document indexing and document retrieval have been extensively studied over the past three decades. The integration of these two classes of models has been the goal of several researchers but it is a very difficult problem. Much of the reason for this is that the indexing component requires inferences as to the semantics of documents. Instead, an approach to retrieval based on probabilistic language modeling will be presented. Models are estimated for each document individually. The approach to modeling is non-parametric and integrates the entire retrieval process into a single model. One advantage of this approach is that collection statistics, which are used heuristically for the assignment of concept probabilities in other probabilistic models, are used directly in the estimation of language model probabilities in this approach. The language modeling approach has been implemented and tested empirically and performs very well on standard test collections and query sets. In order to improve retrieval effectiveness, IR systems use additional techniques such as relevance feedback, unsupervised query expansion and structured queries. These and other techniques are discussed in terms of the language modeling approach and empirical results are given for several of the techniques developed. These results provide further proof of concept for the use of language models for retrieval tasks.

...read moreread less

2,736 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

A statistical interpretation of term specificity and its application in retrieval

[...]

Karen Sparck Jones¹•Institutions (1)

University of Cambridge¹

01 Jan 1972-Journal of Documentation

TL;DR: It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms.

...read moreread less

Abstract: The exhaustivity of document descriptions and the specificity of index terms are usually regarded as independent. It is suggested that specificity should be interpreted statistically, as a function of term use rather than of term meaning. The effects on retrieval of variations in term specificity are examined, experiments with three test collections showing in particular that frequently‐occurring terms are required for good overall performance. It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms. Results for the test collections show that considerable improvements in performance are obtained with this very simple procedure.

...read moreread less

3,559 citations

Book•

analysis of binary data

[...]

David Cox, E. J. Snell

01 Jan 1970

TL;DR: Binary response variables special logistical analyses some complications some related approaches more complex responses.

...read moreread less

Abstract: The first edition of this book (1970) set out a systematic basis for the analysis of binary data and in particular for the study of how the probability of 'success' depends on explanatory variables. The first edition has been widely used and the general level and style have been preserved in the second edition, which contains a substantial amount of new material. This amplifies matters dealt with only cryptically in the first edition and includes many more recent developments. In addition the whole material has been reorganized, in particular to put more emphasis on m.aximum likelihood methods.There are nearly 60 further results and exercises. The main points are illustrated by practical examples, many of them not in the first edition, and some general essential background material is set out in new Appendices.

...read moreread less

2,855 citations

Journal Article•DOI•

On selecting a measure of retrieval effectiveness

[...]

William S. Cooper¹•Institutions (1)

University of California, Berkeley¹

01 Mar 1973-Journal of the Association for Information Science and Technology

TL;DR: It is argued that a user's subjective evaluation of the personal utility of a retrieval system's output to him, if it could be properly quantified, would be a near-ideal measure of retrieval effectiveness.

...read moreread less

Abstract: It is argued that a user's subjective evaluation of the personal utility of a retrieval system's output to him, if it could be properly quantified, would be a near-ideal measure of retrieval effectiveness. A hypothetical methodology is presented for measuring this utility by means of an elicitation procedure. Because the hypothetical methodology is impractical, compromise methods are outlined and their underlying simplifying assumptions are discussed. The more plausible the simplifying assumptions on which a performance measure is based, the better the measure. This, along with evidence gleaned from ‘validation experiments’ of a certain kind, is suggsted as a criterion for selecting or deriving the best measure of effectiveness to use under given test conditions.

...read moreread less

253 citations

Journal Article•DOI•

Index term weighting

[...]

Karen Sparck Jones¹•Institutions (1)

University of Cambridge¹

01 Nov 1973-Information Storage and Retrieval

TL;DR: The results show that one type of weighting leads to material performance improvements in quite different collection environments.

...read moreread less

198 citations

Journal Article•DOI•

Precision Weighting—An Effective Automatic Indexing Method

[...]

C. T. Yu¹, Gerard Salton²•Institutions (2)

University of Alberta¹, Cornell University²

01 Jan 1976-Journal of the ACM

TL;DR: The precision weighting procedure described in the present study uses relevance criteria to weight the terms occurring in user queries as a function of the balance between relevant and nonrelevant documents in which these terms occur; this approximates a semantic know-how of term importance.

...read moreread less

Abstract: A great many automatic indexing methods have been implemented and evaluated over the last few years, and automatic procedures comparable in effectiveness to conventional manual ones are now easy to generate Two drawbacks of the available automatic indexing methods are the absence of reliable linguistic inputs during the indexing process and the lack of formal, analytical proofs concerning the effectiveness of the proposed methods The precision weighting procedure described in the present study uses relevance criteria to weight the terms occurring in user queries as a function of the balance between relevant and nonrelevant documents in which these terms occur; this approximates a semantic know-how of term importance Formal mathematical proofs are given under well-defined conditions of the effectiveness of the method

...read moreread less

156 citations