Term Weighting Approaches in Automatic Text Retrieval
Citations
11,357 citations
Cites methods from "Term Weighting Approaches in Automa..."
...using the “term frequency inverse document frequency” (TF-IDF) weighting scheme developed in the information retrieval area [Salton and Buckley, 1988]....
[...]
8,658 citations
7,539 citations
Cites background or result from "Term Weighting Approaches in Automa..."
...[Fuhr 1989; Salton and Buckley 1988]) accounting for the “importance” of tk for dj play a key role in IR....
[...]
...[Salton and Buckley 1988]....
[...]
...As such, this would seem to contradict a well-known law of IR, according to which the terms with low-to-medium document frequency are the most informative ones [Salton and Buckley 1988]....
[...]
...1998; Lewis 1992a] it has been found that representations more sophisticated than this do not yield significantly better effectiveness, thereby confirming similar results from IR [Salton and Buckley 1988]....
[...]
...As such, this would seem to contradict a well-known “law” of IR, according to which the terms with low-to-medium document frequency are the most informative ones [Salton and Buckley 1988]....
[...]
7,244 citations
6,984 citations
Additional excerpts
...The informativeness of rare features has led practitioners to craft domain-specific feature weightin s, such as TF-IDF (Salton and Buckley, 1988), which pre-emphasize infrequently occurring features....
[...]
...The informativeness of rare features has led practitioners to craft domain-specific feature weightings, such as TF-IDF (Salton and Buckley, 1988), which pre-emphasize infrequently occurring features. We use this old idea as a motivation for applying modern learning-theoretic techniques to the problem of online and stochastic learning, focusing specifically on (sub)gradient methods. Standard stochastic subgradient methods largely follow a predetermined procedural scheme that is oblivious to the characteristics of the data being observed. In contrast, our algorithms dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. The adaptation facilitates finding and identifying very predictive but comparatively rare features. The following toy example of classification with the hinge loss highlights a problem with standard subgradient methods. We receive a sequence of 3-dimensional vectors {zt} and labels yt ∈ {−1,+1}. Let ut be independent ±1-valued random variables, each with probability 1 2 . Each instance zt is associated with a label yt = −1 and is equal to (ut,−ut, 0) 99% of the time, while zt = (ut,−ut, 1) the remaining 1% of the cases with label yt = 1. We would like find a vector x with ‖x‖∞ ≤ 1 such that max{0,−yt 〈x, zt〉} = 0 for all t. Clearly, any solution vector takes the form x = (a, a, 1) with |a| ≤ 1. Standard subgradient methods, such as the one proposed by Zinkevich (2003), iterate as follows: xt+1 = Π{x:‖x‖ ∞ ≤1}(xt − ηtytzt) , where ΠX denotes Euclidean projection onto the set X and ηt = 1/ √ t....
[...]
References
12,059 citations
3,559 citations
2,877 citations
2,105 citations
1,889 citations