Home
/
Topics
/
Document retrieval

Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.

...read moreread less

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
1981
1980
1979
1978
1977
1976
1975
1974
1973
1972
1971
1970
1969

1 / 2

Papers

PDF

Open Access

More filters

Book Chapter•DOI•

Probabilistic Retrieval of OCR Degraded Text Using N-Grams

[...]

Stephen M. Harding¹, W. Bruce Croft¹, C. Weir•Institutions (1)

University of Massachusetts Amherst¹

01 Sep 1997

TL;DR: A web based retrieval application using n-gram retrieval of OCR text and display, with query term highlighting, of the source document image is described, which was less effective but can likely be improved with alternative query component weighting schemes and measures of term similarity.

...read moreread less

Abstract: The retrieval of OCR degraded text using n-gram formulations within a probabilistic retrieval system is examined in this paper. Direct retrieval of documents using n-gram databases of 2 and 3-grams or 2, 3, 4 and 5-grams resulted in improved retrieval performance over standard (word based) queries on the same data when a level of 10 percent degradation or worse was achieved. A second method of using n-grams to identify appropriate matching and near matching terms for query expansion which also performed better than using standard queries is also described. This method was less effective than direct n-gram query formulations but can likely be improved with alternative query component weighting schemes and measures of term similarity. Finally, a web based retrieval application using n-gram retrieval of OCR text and display, with query term highlighting, of the source document image is described.

...read moreread less

78 citations

Proceedings Article•DOI•

Unified utility maximization framework for resource selection

[...]

Luo Si¹, Jamie Callan¹•Institutions (1)

Carnegie Mellon University¹

13 Nov 2004

TL;DR: This new framework shows an efficient and effective way to infer the probabilities of relevance of all the documents across the text databases and provides a more solid framework for distributed information retrieval.

...read moreread less

Abstract: This paper presents a unified utility framework for resource selection of distributed text information retrieval. This new framework shows an efficient and effective way to infer the probabilities of relevance of all the documents across the text databases. With the estimated relevance information, resource selection can be made by explicitly optimizing the goals of different applications. Specifically, when used for database recommendation, the selection is optimized for the goal of high-recall (include as many relevant documents as possible in the selected databases); when used for distributed document retrieval, the selection targets the high-precision goal (high precision in the final merged list of documents). This new model provides a more solid framework for distributed information retrieval. Empirical studies show that it is at least as effective as other state-of-the-art algorithms.

...read moreread less

77 citations

Journal Article•DOI•

A model of knowledge based information retrieval with hierarchical concept

[...]

Young Whan Kim¹, Jin H. Kim•Institutions (1)

KAIST¹

01 May 1990-Journal of Documentation

TL;DR: The proposed model computes the conceptual distance between a query and an object and both are indexed with weighted terms from a hierarchical thesaurus by allowing the index term and the edge of the HCG to be weighted.

...read moreread less

Abstract: This paper discusses a knowledge based information retrieval model with hierarchical thesaurus. The model computes the conceptual distance between a query and an object and both are indexed with weighted terms from a hierarchical thesaurus. The hierarchical thesaurus is represented by a hierarchical‐concept graph (HCG) in which nodes represent concepts and directed edges represent generalisation relationships. Rada et al. have developed a similar model. However, their model considered only a binary indexing scheme and revealed some counter‐intuitive results. Our proposed model extends theirs by allowing the index term and the edge of the HCG to be weighted. A new concept mapping method is devised to overcome Rada's counter‐intuitive results. In addition, a scheme for allowing Boolean operators in user queries is provided with a formula for computing conceptual distance from negated index terms. Experimental results have shown that our model simulates human performance more closely than Rada's model.

...read moreread less

77 citations

Proceedings Article•DOI•

Integrating information retrieval and domain specific approaches for browsing and retrieval in object-oriented class libraries

[...]

Richard Helm, Yoelle Maarek

01 Nov 1991

77 citations

Proceedings Article•DOI•

A machine learning approach for improved BM25 retrieval

[...]

Krysta M. Svore¹, Christopher J. C. Burges¹•Institutions (1)

Microsoft¹

02 Nov 2009

TL;DR: A machine learning approach to BM25-style retrieval is developed that learns, using LambdaRank, from the input attributes of BM25, and significantly improves retrieval effectiveness over BM25 and BM25F.

...read moreread less

Abstract: Despite the widespread use of BM25, there have been few studies examining its effectiveness on a document description over single and multiple field combinations. We determine the effectiveness of BM25 on various document fields. We find that BM25 models relevance on popularity fields such as anchor text and query click information no better than a linear function of the field attributes. We also find query click information to be the single most important field for retrieval. In response, we develop a machine learning approach to BM25-style retrieval that learns, using LambdaRank, from the input attributes of BM25. Our model significantly improves retrieval effectiveness over BM25 and BM25F. Our data-driven approach is fast, effective, avoids the problem of parameter tuning, and can directly optimize for several common information retrieval measures. We demonstrate the advantages of our model on a very large real-world Web data collection.

...read moreread less

77 citations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
…
97
98
99
100
101
102
103
…
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

Network Information

Performance

Metrics

6,866

Papers

224,605

Citations

No. of papers in the topic in previous years
Year	Papers
2023	9
2022	39
2021	107
2020	130
2019	144
2018	111

Document retrieval

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics