scispace - formally typeset
Open AccessProceedings ArticleDOI

Relevant document distribution estimation method for resource selection

Luo Si, +1 more
- pp 298-305
Reads0
Chats0
TLDR
It is shown that the CORI algorithm does not do well in environments with a mix of "small" and "very large" databases, and a new resource selection algorithm is proposed that uses information about database sizes as well as database contents.
Abstract
Prior research under a variety of conditions has shown the CORI algorithm to be one of the most effective resource selection algorithms, but the range of database sizes studied was not large. This paper shows that the CORI algorithm does not do well in environments with a mix of "small" and "very large" databases. A new resource selection algorithm is proposed that uses information about database sizes as well as database contents. We also show how to acquire database size estimates in uncooperative environments as an extension of the query-based sampling used to acquire resource descriptions. Experiments demonstrate that the database size estimates are more accurate for large databases than estimates produced by a competing method; the new resource ranking algorithm is always at least as effective as the CORI algorithm; and the new algorithm results in better document rankings than the CORI algorithm.

read more

Content maybe subject to copyright    Report

Relevant Document Distribution Estimation Method for
Resource Selection
Luo Si and Jamie Callan
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
lsi@cs.cmu.edu, callan@cs.cmu.edu
ABSTRACT
Prior research under a variety of conditions has shown the CORI
algorithm to be one of the most effective resource selection
algorithms, but the range of database sizes studied was not large.
This paper shows that the CORI algorithm does not do well in
environments with a mix of "small" and "very large" databases.
A new resource selection algorithm is proposed that uses
information about database sizes as well as database contents.
We also show how to acquire database size estimates in
uncooperative environments as an extension of the query-based
sampling used to acquire resource descriptions. Experiments
demonstrate that the database size estimates are more accurate
for large databases than estimates produced by a competing
method; the new resource ranking algorithm is always at least as
effective as the CORI algorithm; and the new algorithm results
in better document rankings than the CORI algorithm.
Categories & Subject Descriptors:
H.3.3 [Information Search and Retrieval]:
General Terms: Algorithms
Keywords: Resource Selection
1. INTRODUCTION
Distributed information retrieval,alsoknownasfederated
search, is ad-hoc search in environments containing multiple,
possibly many, text databases [1]. Distributed information
retrieval includes three sub-problems: i) acquiring information
about the contents of each database (resource representation)
[1,6], ii) ranking the resources and selecting a small number of
them for a given query (resource ranking) [1,3,5,7,8,12], and iii)
merging the results returned from the selected databases into a
single ranked list before presenting it to the end user (result-
merging) [1,4,13]. Early distributed IR research focused on
cooperative environments in which search engines could be
relied upon to provide corpus vocabulary, corpus statistics, and
search engine characteristics when requested to do so. Recent
research also addresses uncooperative environments in which
search engines only run queries and return documents.
Most resource ranking algorithms rank by how well database
contents appear to match a query, so resource descriptions have
tended to emphasize content [1]. Prior research suggests that it is
important to compensate for database size when assessing
similarity [12,7], but it has been unclear how to estimate
database sizes accurately in uncooperative environments.
This paper presents ReDDE, a new resource-ranking algorithm
that explicitly tries to estimate the distribution of relevant
documents across the set of available databases. The ReDDE
algorithm considers both content similarity and database size
when making its estimates. A new algorithm for estimating
database sizes in uncooperative environments is also presented.
Previous research showed that improved resource selection is
correlated with improved document rankings; this paper shows
that better resource selection does not always produce better
document rankings. An analysis of this contradiction leads to a
more robust version of the ReDDE algorithm.
The next section discusses prior research. Section 3 describes a
new method of estimating database sizes. Section 4 explains the
new ReDDE resource selection algorithm. Section 5 discusses
the subtle relationship between resource selection accuracy and
document retrieval accuracy, and proposes the modified version
of the ReDDE algorithm. Section 6 describes experimental
methodology. Sections 7, 8 and 9 present experiment results for
database size estimation, resource selection and document
retrieval. Section 10 concludes.
2. PREVIOUS WORK
Our research interest is uncooperative environments, such as the
Internet, in which resource providers provide basic services
(e.g., running queries, retrieving documents), but don’t provide
detailed information about their resources. In uncooperative
environments perhaps the best method of acquiring resource
descriptions is query-based sampling [1], in which a resource
description is constructed by sampling database contents via the
normal process of running queries and retrieving documents.
Query-based sampling has been shown to acquire accurate
unigram resource descriptions using a small number of queries
(e.g., 75) and a small number of documents (e.g., 300).
Database size is an important component of a resource
description, but there has been limited prior research on how to
estimate it in uncooperative environments. Liu and Yu proposed
using a basic capture-recapture methodology to estimate the size
of a database [11]. Capture-recapture assumes that there are two
(or more) independent samples from a population. Let N be the
population size, A the event that an item is included in the first
sample, which is of size n1, B the event that an item is included
in the second sample, which is of size n2, and m2 the number of
items that appeared in both samples. The probabilities of events
A and B, and the relationship between them, are shown below.
Permission to make digital or hard copies of all or part of this work
for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial
advantage and that copies bear this notice and the full citation on the
first page. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee.
SIGIR ’03, July 28-Aug 1, 2003, Toronto, Canada.
Copyright 2003 ACM 1-58113-646-3/03/0007…$5.00.

()
N
n
AP
1
=
(1)
()
N
n
BP
2
=
(2)
()
2
2
|
n
m
BAP =
(3)
The two samples are assumed to be independent, so:
()
)(| APBAP
=
(4)
Thus, the population size is estimated as:
2
21
^
m
nn
N =
(5)
Liu and Yu used it to estimate database sizes by randomly
sending queries to a database and sampling from the document
ids that were returned. They reported a rather accurate ability to
estimate database sizes [11], but the experimental methodology
might be considered unrealistic. For example, when estimating
the size of a database containing 300,000 documents, the
sampling procedure used 2,000 queries and expected to receive a
ranked list of 1,000 document ids per query, for a total of
2,000,000 (non-unique) document ids examined. This cost might
be considered excessive. Our goal is a procedure that estimates
database sizes at a far lower cost.
There is a large body of prior research on resource selection
(e.g., [1,3,5,7,8,12]). Space limitations preclude discussing it all,
so we restrict our attention to a few that have been studied often
or recently in prior research.
gGlOSS [6] is based on the vector space model. It represents a
database by a vector that is composed of the document
frequencies of different words in the databases. Ipeirotis’s
Hierarchical Database Sampling and Selection algorithm [8]
used information from the search engine to get document
frequencies for some words and estimated the document
frequencies of other words by using Mandelbrot’s law. The
document frequency information was used as a part of the
database description to build a hierarchical structure for
databases. They showed that the hierarchical structure with extra
information had a better retrieval performance than the CORI
resource selection algorithm, but did not explicitly evaluate
resource selection performance. D’Souza and Thom’s n-term
indexing method [5] represents each resource by a set of
document surrogates, with one surrogate per original document.
They assume that they can access the content of each document
in every database, which is not the case in our work.
The CORI resource selection algorithm [1,2] represents each
database using the words it contains, their frequencies, and a
small number of corpus statistics. Prior research indicates that
the CORI algorithm is one of the most stable and effective
resource algorithms (e.g., [7,12]) and it was used as a baseline in
the research reported here. Resource ranking is done using a
Bayesian inference network and an adaptation of the Okapi term
frequency normalization [15]. Details can be found in [1,2].
The last step of distributed information retrieval is merging
ranked lists produced by different search engines. It is usually
treated as a problem of transforming database-specific document
scores into database-independent document scores. The CORI
merge algorithm [1] is a heuristic linear combination of the
database score and the document score. Calv, et al., [4] used
logistic regression and relevance information to learn merging
models. The Semi-Supervised Learning algorithm [13] uses the
documents acquired by query-based sampling as training data
andlinearregressiontolearnmergingmodels.
3. DATABASE SIZE ESTIMATION
Database size estimates are desirable because people often ask
about size when presented with a new database, and because
resource selection errors are sometimes due to weaknesses in
normalizing the term frequency statistics obtained from
databases of varying sizes. Database size can be expressed in
different ways, such as the size of the vocabulary, the number of
word occurrences, and the number of documents. In this paper
we define database size to be the number of documents. This
metric can be easily converted to the other metrics if needed.
3.1 The Sample-Resample Method
The capture-recapture method of estimating database size [11]
requires many interactions with a database. Our goal is an
algorithm that requires few interactions, preferably one that can
reuse information that is already acquired for other reasons.
We start by assuming that resource descriptions are created by
query-based sampling, as in [13]. We assume that each resource
description lists the number of documents sampled, the terms
contained in sampled documents, and the number of sampled
documents containing each term.
We also assume that the search engine indicates the number of
documents that match a query. The number of matching
documents is a common part of search engine user interfaces,
even in uncooperative environments. Even if the search engine
only approximates the number of documents that match the
query, that number is an important clue to database size.
The sample-resample method of estimating database size is as
follows. A term from the database’s resource description is
picked randomly and submitted to the database as a single-term
query (resampling); the database returns the number of
documents that match this one-term query (the document
frequency of the term) and the ids of a few top-ranked
documents, which are discarded. Let Cj be the database, and
Cj_samp be the documents sampled from the database when the
resource description was created. Let N_cj be the (unknown)
size of Cj, and N_cj_samp be the size of Cj_samp. Let qi be the
query term selected from the resource description for Cj. Let
df_qicj be the number of documents in Cj that contain qi
(returned by the search engine) and df_qicj_samp be the number
of documents in Cj_samp that contain qi.
The event that a document sampled from the database contains
term qi is denoted as A. The event that a document from the
database contains the qi is denoted as B. The probabilities of
these events can be calculated as shown below.
()
sampc
sampcq
j
ji
N
df
AP
_
_
=
(6)
()
j
ji
C
cq
N
df
BP =
(7)

If we assume that the documents sampled from the database are
a good representation of the whole database, then P(A) P(B),
and N can be estimated as shown in Equation 8.
sampcq
sampccq
C
ji
jji
j
df
Ndf
N
_
_
^
*
=
(8)
Additional estimates of the database size are acquired by
sending additional one-term queries to the database. An estimate
based on the mean of the individual estimates reduces variance.
3.2 Database Size Evaluation Metrics
Liu and Yu [11] evaluated the accuracy of database size
estimates using percentage error, which we call absolute error
ratio (AER) in this paper. Let N* denote the actual database size
and N the estimate. AER is calculated as shown below.
*
*
N
NN
AER
=
(9)
When we evaluate a set of estimates for a set of databases, we
calculate the mean absolute error ratio (MAER).
3.3 Database Size Estimation Costs
A fair comparison of algorithms must consider their costs. The
significant cost for the capture-recapture and sample-resample
methods is the number of interactions with the search engine.
Liu and Yu assumed that uncooperative databases would return
ranked lists of up to 1,000 document ids [11]. This assumption is
not true in some environments. For example, AltaVista and
Google initially return only the top 10 or 20 document ids. If we
assume the search engine returns document ids in pages of 20,
then 50 interactions is required to obtain a sample of 1,000 ids.
A slight adjustment to Liu and Yu’s original capture-recapture
method reduces its cost. Only one document id from each
sample is used by the capture-recapture method. If we assume
that the method decides ahead of time which rank to take a
sample from, the number of interactions can be reduced. If the
search engine allows a particular range of the search results to
be selected then only 1 interaction per sample is required. If the
search engine requires that the list be scanned sequentially from
the beginning, in pages containing 20 document ids each, then
25 interactions is required, on average, to obtain one sample.
The primary cost of the new sample-resample algorithm is the
queries used to resample the document frequencies of a few
terms. Each resample query requires one database interaction.
The experiments reported in this paper used 5 resample queries
per database. The sample-resample method also requires access
to the resource descriptions that support resource selection.
4. RESOUCE SELECTION
The goal of resource selection is to select a small set of
resources that contain a lot of relevant documents. If the
distribution of relevant documents across the different databases
were known, databases could be ranked by the number of
relevant documents they contain; such a ranking is called a
relevance based ranking (RBR) [1,7]. Typically, and in the work
reported here, resource ranking algorithms are evaluated by how
closely they approximate a relevance-based ranking.
The number of documents relevant to query q in database Cj is
estimated as:
¦
=
ji
j
Cd
Cjii
NCdPdrelPjq *)|(*)|()(_lRe
^
(10)
where Ncj is the number of documents in Cj; we can substitute
the estimated database size
^
j
C
N for Ncj. For the probabilities
P (di | Cj), if we have downloaded all the documents from Cj
and built a complete resource description for it, these
probabilities will be 1/Ncj. For the sampled resource
description, as long as the sampled resource description is
representative, the number of relevant documents can be
estimated as follows:
¦
=
sampCd
C
sampC
i
ji
j
j
N
N
drelPjq
_
^
_
^
*
1
*)|()(_lRe
(11)
where Ncj_samp is the number of documents sampled from Cj.
The only item left to estimate is P (rel | di), the probability of
relevance given a specific document. Calculating this probability
is the goal of most probabilistic retrieval models, and is
generally viewed as a difficult problem; we do not solve it in
this research.
Instead, we define as a reference the centralized complete
database, which is the union of all of the individual databases
available in the distributed IR environment. We define
P (rel | di), the probability of relevance given a document, as the
probability of relevance given the document rank when the
centralized complete database is searched by an effective
retrieval method. This probability distribution is modeled by a
step function, which means that for the documents at the top of
the ranked list the probabilities are a positive constant, and for
all other documents they are 0. Although this approximation of
the relevance probability is rather rough, it is similar to the
modeling of relevance by most automatic relevance feedback
methods. Probability of relevance is modeled formally as:
()
°
¯
°
®
<
=
otherwise
NratiodcentralRankifC
drelP
alliq
i
0
ˆ
)(_
|
(12)
where rank_central (di) is the rank of document di in the
centralized complete database, and ratio is a threshold. This
threshold indicates how the algorithm focuses attention on
different parts of the centralized complete DB ranking. In our
experiments, the ratio was set to 0.003, which is equivalent to
considering the top 3,000 documents in a database containing
1,000,000 documents.
^
all
N
is the estimated total number of
documents in the centralized complete database. Cq is a query-
dependent constant. Although no single setting of the ratio
parameter is optimal for every testbed, experiments (not
reported here) show that the ReDDE algorithm is effective
across a wide range of ratio settings (e.g., 0.002 to 0.005).
A centralized complete database is not available. However, a
centralized
sample database is easily available; it contains the

documents obtained by query-based sampling when database
resource descriptions were constructed. The centralized sample
database is a representative subset of the centralized complete
database. Prior research showed that a centralized sample
database is useful when normalizing document scores during
result merging [13]. We use it here for resource ranking.
The query is submitted to the centralized sample database; a
document ranking is returned. Given representative resource
descriptions and database size estimates we can estimate how
documents from the different databases would construct a
ranked list if the centralized complete database existed and were
searched. In particular, the rank is calculated as follows:
¦
<
=
)(_
)(_
_)(
^
)(
)(_
diSampRank
djSampRank
d
sampdjc
djc
j
N
N
diCentralRank
(13)
Plugging Equation 12 and 13 into Equation 11, the values of
^
)(_lRe jq
can be calculated. These values still contain a query
dependant constant Cq, which comes from Equation 12. The
useful statistic is the distribution of relevant documents in
different databases. That information is sufficient to rank the
databases. The estimated distribution can be calculated by
normalizing these values from Equation 11, as shown below.
¦
=
i
iql
jql
jqlDist
^
^
^
)(_Re
)(_Re
)(_Re_
(14)
Equation 14 provides the computable distribution, without any
constants. The databases can now be ranked by the estimated
percentage of relevant documents they contain.
The experiments that test the effectiveness of this method are
described in Section 8.
5. RETRIEVAL PERFORMANCE
Generally the effectiveness of a distributed information retrieval
system is not evaluated by the Precision at Recall points metric
(e.g., “11-point Recall Precision”). Only a subset of the
databases is selected, so it is usually impossible to retrieve all of
the relevant documents. Precision at specified document ranks is
often used, particularly for interactive retrieval where someone
may only look at the first several screens of results.
Suppose the goal of a distributed information retrieval system is
to maximize Precision within the top-ranked 100 documents.
The goal of the resource selection algorithm is to select a small
number of databases that contain a large number of relevant
documents. Improved resource selection usually produces
improved retrieval accuracy [12], but not always. This was also
observed in [3].
If a centralized complete database were accessible, a search
algorithm would return the top 100 documents in the ranked list;
this result set is only a tiny percentage of all the documents. The
ReDDE algorithm evaluates a much larger percentage of the
centralized complete database. For example, if the ratio
(Equation 12) is 0.003 and the total testbed size is 1,000,000
documents, although the goal is to retrieve 100 relevant
documents, the resource selection algorithm essentially attempts
to maximize the number of relevant documents that would rank
in the top 3,000 in the centralized complete database.
Optimizing the number of relevant documents in the top 100 and
top 3,000 retrieved documents is correlated, but not identical.
One could decrease the ratio used by the ReDDE resource
selection algorithm. However, decisions are based on only a
small number of sampled documents; using a small ratio would
cause very few databases to have nonzero estimates. A better
solution is to use two ratios. We call this the
modified ReDDE
algorithm
. Databases that have large enough estimation values
with the smaller ratio are sorted by these values. All other
databases are sorted by estimation values created from a larger
ratio. Thus for every database there are two estimation values
(DistRel_
r1j, DistRel_r2j), which are calculated by Equation 14
using two different ratios r1 and r2. In our experiments, r1 was
empirically set to 0.0005 and r2 was set to 0.003. The procedure
can be formalized as follows:
1. First rank all the databases that have
DistRel_
r1j >= backoff_Thres
2. For all the other databases rank them with the values
DistRel_r2j
where the backoff_Thres is the backoff threshold.
Backoff_Thres and was set to 0.1 in our experiments.
6. EXPERIMENT DATA
The database size estimation and ReDDE resource selection
algorithms were tested on a variety of testbeds (Table 1).
Trec123-100col-bysource: 100 databases created from TREC
CDs 1, 2 and 3. They are organized by source and publication
date [1,13], and are somewhat heterogeneous. The sizes of the
databases are not skewed.
Trec4-kmeans: 100 databases created from TREC 4 data. A k-
means clustering algorithm was used to organize the databases
by topic [14], so the databases are homogenous and the word
distributions are very skewed. The sizes of the databases are
moderately skewed.
In order to show the effects of database sizes on database size
estimation, resource selection and document retrieval
performance in environments containing many "small" DBs and
a few "very large" DBs, an additional set of testbeds was built
from the trec123-100col-bysource collection. In these testbeds
there are many "small" DBs and a few "very large" DBs. Each
new testbed contained 2 databases that are about an order of
magnitude larger than other databases in the testbed Different
testbeds were designed to test effectiveness with different types
of "very large" databases: "representative" databases (Trec123-
2ldb-60col), "relevant" databases (Trec123-AP-WSJ-60col) and
"nonrelevant" databases (Trec123-FR-DOE-81col).
Trec123-2ldb-60col (“representative”): The databases in
Trec123-100col-bysource were sorted alphabetically. Every fifth
database, starting with the first, was collapsed into one large
“representative database called LDB1. Every fifth database,
starting with the second, was collapsed into a large database

called LDB2. The other 60 databases were left unchanged.
LDB1 and LDB2 are about 20 times larger than the other
databases but have about the same density of relevant
documents as the other databases (Table 2).
Trec123-AP-WSJ-60col (“relevant”): The 24 Associated Press
collections in the trec123-100col-bysource testbed were
collapsed into a single large APall database. The 16 Wall Street
Journal collections were collapsed into a single large WSJall
collection (Table 2). The other 60 collections were unchanged.
The APall and WSJall collections are much larger than the other
databases, and they also have a higher density of documents
relevant to TREC queries than the other 60 collections. Most
relevant documents are contained in these two large databases.
Trec123-FR-DOE-81col (“nonrelevant”): The 13 Federal
Register collections in the trec123-100col-bysource testbed were
collapsed into a single large FRall collection. The 6 Department
of Energy collections were collapsed into a single large DOEall
collection (Table 2). The other 81 collections were unchanged.
The FRall and DOEall are much larger than the other databases,
but have a much lower density of relevant documents.
Trec123-10col: This testbed was created for testing the
effectiveness of database size estimation algorithms on large
databases (Table 1). The databases in trec123-100col-bysource
were sorted alphabetically. Every tenth databases, starting with
the first, was combined to create the first new database. Every
tenth database, starting with the second, was combined to create
the second new database. And so on. Altogether there are ten
collections.
50 queries were created from the title fields of TREC topics 51-
100 for the trec123 testbed and another 50 queries were created
from the description fields of TREC topics 201-250 for the
trec4-kmeans testbed (Table 3).
7. EXPERIMENT RESULTS:
DATABASE SIZE ESTIMATION
The sample-resample database size estimation algorithm was
evaluated on two testbeds: trec123-100col and trec123-10col.
The first testbed contains 100 small databases and the second
testbed contains 10 large databases (Table 1). The capture-
recapture algorithm was used as a baseline.
The cost of capture-recapture and sample-resample is measured
in the number of remote search engine interactions required to
make an estimate. Both capture-recapture and sample-resample
send queries to a search engine and get some information
returned; for capture-recapture it is a page of 20 document ids;
for sample-resample it is the number of matching documents.
We consider these costs equivalent, i.e., one search engine
interaction. Sample-resample also assumes access to a database
resource description, which capture-recapture does not.
Ordinarily this cost (about 80 queries and 300 document
downloads [1]) would be allocated to resource selection, but to
be completely fair, for this comparison we allocate it to the
sample-resample algorithm. The total cost of the sample-
resample algorithm is thus 385 search engine interactions: 5
sample-resample queries plus 380 interactions to build a
resource description. Both algorithms are allotted 385 search
engine allocations in the experiments reported below.
The cost of a capture-recapture algorithm is affected strongly by
the type of ranked list access supported by the search engine
(Section 3.3). The algorithm is most efficient when the search
engine allows direct access to a specified section of the ranked
list; we call this the “Direct” variant. If the search engine
requires that ranked-list results be accessed sequentially in
blocks of 20 ids, the capture-recapture algorithm would only be
able to obtain about 15 samples, which is too few for an accurate
estimate. In this case we use a variant that makes its choice from
the first block of 20 ids; we call this the “Top variant.
The basic capture-recapture algorithm [11] considered just one
document id per sample. However, it acquires 20 document ids,
so we examined the effects on accuracy of using just 1 or all 20
of the document ids returned by the search engine.
Each of the capture-recapture variants was allowed to send
about 385 queries to the database; document ids gotten in the
first half of the queries were the first sample; document ids
gotten in the second half of the queries were the second sample.
For the sample-resample experiments, a resource description
was created using query-based sampling in which randomly
selected one-term queries were submitted to the search engine
and the top 4 documents per query were downloaded until 300
documents had been downloaded. This process required about
80 queries. Only 5 resampling queries were used.
The experimental results are summarized in Table 4. Sample-
resample was more accurate than all of the capture-recapture
methods on the trec123-10col testbed, which contains larger
databases (lower values are desired for the MAER metric).
Sample-resample was more accurate than all but the “Direct
All” variant of the capture-recapture methods on the trec123-
Table 1: Summary statistics for three distributed IR testbeds.
Num of documents
(x 1000)
Size (MB)
Testbed
Size
(GB)
Min Avg Max Min Avg Max
Trec123-100col 3.2 0.7 10.8 39.7 28 32 42
Trec4-kmeans 2.0 0.3 5.7 82.7 4 20 249
Trec123-10 col 3.2 17.6 107.8 263.2 300 320 378
Table 2: Summary statistics for the very large” databases.
Collection
Num of documents
(x 1000) Size (MB)
LDB1 231.3 665
LDB2 199.7 667
APall 242.9 764
WSJall 173.3 533
FRall 45.8 492
DOEall 226.1 194
Table 3: Query set statistics.
Collections
TREC
Topic Set
TREC
Topic Field
Average Length
(Words)
Trec123 51-100 Title 3
Trec4 201-250 Description 7

Citations
More filters
Proceedings ArticleDOI

Sources of evidence for vertical selection

TL;DR: This work addresses the problem of vertical selection, predicting relevant verticals for queries issued to the search engine's main web search page by focusing on 18 different verticals, which differ in terms of semantics, media type, size, and level of query traffic.
Journal ArticleDOI

Federated Search

TL;DR: The goal of this work, is to provide a comprehensive summary of the previous research on the federated search challenges described above.
Journal ArticleDOI

Mining Query Logs: Turning Search Usage Data into Knowledge

TL;DR: This survey is on introducing to the discipline of query mining by showing its foundations and by analyzing the basic algorithms and techniques that are used to extract useful knowledge from this (potentially) infinite source of information.
Proceedings ArticleDOI

Retrieval and feedback models for blog feed search

TL;DR: This work adapts a state-of-the-art federated search model to the feed retrieval task, showing a significant improvement over algorithms based on the best performing submissions in the TREC 2007 Blog Distillation task and develops a novel query expansion technique using the link structure in Wikipedia.
Proceedings ArticleDOI

Evaluating different methods of estimating retrieval quality for resource selection

TL;DR: Two new methods for estimating retrieval quality are introduced: the first one computes the empirical distribution of the probabilities of relevance from a small library sample, and assumes it to be representative for the whole library, and the second assumes that the indexing weights follow a normal distribution, leading to anormal distribution for the document scores.
References
More filters
Proceedings ArticleDOI

Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

TL;DR: The 2-Poisson model for term frequencies is used to suggest ways of incorporating certain variables in probabilistic models for information retrieval, and substantial performance improvements are demonstrated.

Some Simple Effective Approximations to the 2-Poisson Model

S. Walker
TL;DR: In this paper, the 2-Poisson model for term frequencies is used to suggest ways of incorporating certain variables in probabilistic models for information retrieval, such as within-document term tkequency, document length, and within-query term frequency.
Book ChapterDOI

Distributed information retrieval

TL;DR: A broad and diverse group of experimental results is presented to demonstrate that the algorithms are effective, efficient, robust, and scalable.
Proceedings ArticleDOI

STARTS: Stanford proposal for Internet meta-searching

TL;DR: This paper describes STARTS, an emerging protocol for Internet retrieval and search that facilitates the task of querying multiple document sources and discusses the process that led to its definition.
Journal ArticleDOI

TREC and TIPSTER experiments with INQUERY

TL;DR: Improvements to the probabilistic information retrieval system based upon a Bayesian inference network model are described, including transforming forms-based specifications of information needs into complex structured queries, automatic query expansion, automatic recognition of features in documents, relevance feedback, and simulated document routing.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What have the authors contributed in "Relevant document distribution estimation method for resource selection" ?

Prior research under a variety of conditions has shown the CORI algorithm to be one of the most effective resource selection algorithms, but the range of database sizes studied was not large. This paper shows that the CORI algorithm does not do well in environments with a mix of `` small '' and `` very large '' databases. The authors also show how to acquire database size estimates in uncooperative environments as an extension of the query-based sampling used to acquire resource descriptions. 

It is likely that training data can be used to automatically determine testbed-specific parameter settings, improving both accuracy and generality, but this remains a topic for future research. 

In uncooperative environments perhaps the best method of acquiring resource descriptions is query-based sampling [1], in which a resource description is constructed by sampling database contents via the normal process of running queries and retrieving documents. 

If the search engine requires that the list be scanned sequentially from the beginning, in pages containing 20 document ids each, then 25 interactions is required, on average, to obtain one sample. 

100 document ids and scores were returned by from eachselected database, which the result-merging algorithm compiled into a final ranked list of documents. 

Although the capture-recapture algorithm can be better than the sample-resample algorithm when databases are small, its success depends on a stronger assumption, i.e., that the search engine supports direct access to specific segments of a ranked list. 

Each of the capture-recapture variants was allowed to send about 385 queries to the database; document ids gotten in the first half of the queries were the first sample; document ids gotten in the second half of the queries were the second sample. 

If the authors assume the search engine returns document ids in pages of 20, then 50 interactions is required to obtain a sample of 1,000 ids. 

If the search engine requires that ranked-list results be accessed sequentially in blocks of 20 ids, the capture-recapture algorithm would only be able to obtain about 15 samples, which is too few for an accurate estimate. 

A kmeans clustering algorithm was used to organize the databases by topic [14], so the databases are homogenous and the word distributions are very skewed. 

Precision at specified document ranks is often used, particularly for interactive retrieval where someone may only look at the first several screens of results. 

it acquires 20 document ids, so the authors examined the effects on accuracy of using just 1 or all 20 of the document ids returned by the search engine. 

It also demonstrates another use for a centralized sample database, extending its use as a surrogate for the (unavailable) centralized complete database.