How many document ids and scores were returned by each database?

100 document ids and scores were returned by from eachselected database, which the result-merging algorithm compiled into a final ranked list of documents.

What is the way to rank a database?

Although the capture-recapture algorithm can be better than the sample-resample algorithm when databases are small, its success depends on a stronger assumption, i.e., that the search engine supports direct access to specific segments of a ranked list.

What is the use of a centralized sample database?

It also demonstrates another use for a centralized sample database, extending its use as a surrogate for the (unavailable) centralized complete database.

(Open Access) Relevant document distribution estimation method for resource selection (2003) | Luo Si

Q: What are the future works in "Relevant document distribution estimation method for resource selection" ?

It is likely that training data can be used to automatically determine testbed-specific parameter settings, improving both accuracy and generality, but this remains a topic for future research.

Q: How many interactions is required to obtain a sample of 1,000 ids?

If the authors assume the search engine returns document ids in pages of 20, then 50 interactions is required to obtain a sample of 1,000 ids.

Q: What is the skewed distribution of the databases?

A kmeans clustering algorithm was used to organize the databases by topic [14], so the databases are homogenous and the word distributions are very skewed.

Relevant Document Distribution Estimation Method for

Resource Selection

Luo Si and Jamie Callan

School of Computer Science

Carnegie Mellon University

Pittsburgh, PA 15213

lsi@cs.cmu.edu, callan@cs.cmu.edu

ABSTRACT

Prior research under a variety of conditions has shown the CORI

algorithm to be one of the most effective resource selection

algorithms, but the range of database sizes studied was not large.

This paper shows that the CORI algorithm does not do well in

environments with a mix of "small" and "very large" databases.

A new resource selection algorithm is proposed that uses

information about database sizes as well as database contents.

We also show how to acquire database size estimates in

uncooperative environments as an extension of the query-based

sampling used to acquire resource descriptions. Experiments

demonstrate that the database size estimates are more accurate

for large databases than estimates produced by a competing

method; the new resource ranking algorithm is always at least as

effective as the CORI algorithm; and the new algorithm results

in better document rankings than the CORI algorithm.

Categories & Subject Descriptors:

H.3.3 [Information Search and Retrieval]:

General Terms: Algorithms

Keywords: Resource Selection

1. INTRODUCTION

Distributed information retrieval,alsoknownasfederated

search, is ad-hoc search in environments containing multiple,

possibly many, text databases [1]. Distributed information

retrieval includes three sub-problems: i) acquiring information

about the contents of each database (resource representation)

[1,6], ii) ranking the resources and selecting a small number of

them for a given query (resource ranking) [1,3,5,7,8,12], and iii)

merging the results returned from the selected databases into a

single ranked list before presenting it to the end user (result-

merging) [1,4,13]. Early distributed IR research focused on

cooperative environments in which search engines could be

relied upon to provide corpus vocabulary, corpus statistics, and

search engine characteristics when requested to do so. Recent

research also addresses uncooperative environments in which

search engines only run queries and return documents.

Most resource ranking algorithms rank by how well database

contents appear to match a query, so resource descriptions have

tended to emphasize content [1]. Prior research suggests that it is

important to compensate for database size when assessing

similarity [12,7], but it has been unclear how to estimate

database sizes accurately in uncooperative environments.

This paper presents ReDDE, a new resource-ranking algorithm

that explicitly tries to estimate the distribution of relevant

documents across the set of available databases. The ReDDE

algorithm considers both content similarity and database size

when making its estimates. A new algorithm for estimating

database sizes in uncooperative environments is also presented.

Previous research showed that improved resource selection is

correlated with improved document rankings; this paper shows

that better resource selection does not always produce better

document rankings. An analysis of this contradiction leads to a

more robust version of the ReDDE algorithm.

The next section discusses prior research. Section 3 describes a

new method of estimating database sizes. Section 4 explains the

new ReDDE resource selection algorithm. Section 5 discusses

the subtle relationship between resource selection accuracy and

document retrieval accuracy, and proposes the modified version

of the ReDDE algorithm. Section 6 describes experimental

methodology. Sections 7, 8 and 9 present experiment results for

database size estimation, resource selection and document

retrieval. Section 10 concludes.

2. PREVIOUS WORK

Our research interest is uncooperative environments, such as the

Internet, in which resource providers provide basic services

(e.g., running queries, retrieving documents), but don’t provide

detailed information about their resources. In uncooperative

environments perhaps the best method of acquiring resource

descriptions is query-based sampling [1], in which a resource

description is constructed by sampling database contents via the

normal process of running queries and retrieving documents.

Query-based sampling has been shown to acquire accurate

unigram resource descriptions using a small number of queries

(e.g., 75) and a small number of documents (e.g., 300).

Database size is an important component of a resource

description, but there has been limited prior research on how to

estimate it in uncooperative environments. Liu and Yu proposed

using a basic capture-recapture methodology to estimate the size

of a database [11]. Capture-recapture assumes that there are two

(or more) independent samples from a population. Let N be the

population size, A the event that an item is included in the first

sample, which is of size n1, B the event that an item is included

in the second sample, which is of size n2, and m2 the number of

items that appeared in both samples. The probabilities of events

A and B, and the relationship between them, are shown below.

Permission to make digital or hard copies of all or part of this work

for personal or classroom use is granted without fee provided that

copies are not made or distributed for profit or commercial

advantage and that copies bear this notice and the full citation on the

first page. To copy otherwise, or republish, to post on servers or to

redistribute to lists, requires prior specific permission and/or a fee.

SIGIR ’03, July 28-Aug 1, 2003, Toronto, Canada.

()

(1)

()

(2)

()

BAP =

(3)

The two samples are assumed to be independent, so:

()

)(| APBAP

(4)

Thus, the population size is estimated as:

N =

(5)

Liu and Yu used it to estimate database sizes by randomly

sending queries to a database and sampling from the document

ids that were returned. They reported a rather accurate ability to

estimate database sizes [11], but the experimental methodology

might be considered unrealistic. For example, when estimating

the size of a database containing 300,000 documents, the

sampling procedure used 2,000 queries and expected to receive a

ranked list of 1,000 document ids per query, for a total of

2,000,000 (non-unique) document ids examined. This cost might

be considered excessive. Our goal is a procedure that estimates

database sizes at a far lower cost.

There is a large body of prior research on resource selection

(e.g., [1,3,5,7,8,12]). Space limitations preclude discussing it all,

so we restrict our attention to a few that have been studied often

or recently in prior research.

gGlOSS [6] is based on the vector space model. It represents a

database by a vector that is composed of the document

frequencies of different words in the databases. Ipeirotis’s

Hierarchical Database Sampling and Selection algorithm [8]

used information from the search engine to get document

frequencies for some words and estimated the document

frequencies of other words by using Mandelbrot’s law. The

document frequency information was used as a part of the

database description to build a hierarchical structure for

databases. They showed that the hierarchical structure with extra

information had a better retrieval performance than the CORI

resource selection algorithm, but did not explicitly evaluate

resource selection performance. D’Souza and Thom’s n-term

indexing method [5] represents each resource by a set of

document surrogates, with one surrogate per original document.

They assume that they can access the content of each document

in every database, which is not the case in our work.

The CORI resource selection algorithm [1,2] represents each

database using the words it contains, their frequencies, and a

small number of corpus statistics. Prior research indicates that

the CORI algorithm is one of the most stable and effective

resource algorithms (e.g., [7,12]) and it was used as a baseline in

the research reported here. Resource ranking is done using a

Bayesian inference network and an adaptation of the Okapi term

frequency normalization [15]. Details can be found in [1,2].

The last step of distributed information retrieval is merging

ranked lists produced by different search engines. It is usually

treated as a problem of transforming database-specific document

scores into database-independent document scores. The CORI

merge algorithm [1] is a heuristic linear combination of the

database score and the document score. Calv, et al., [4] used

logistic regression and relevance information to learn merging

models. The Semi-Supervised Learning algorithm [13] uses the

documents acquired by query-based sampling as training data

andlinearregressiontolearnmergingmodels.

3. DATABASE SIZE ESTIMATION

Database size estimates are desirable because people often ask

about size when presented with a new database, and because

resource selection errors are sometimes due to weaknesses in

normalizing the term frequency statistics obtained from

databases of varying sizes. Database size can be expressed in

different ways, such as the size of the vocabulary, the number of

word occurrences, and the number of documents. In this paper

we define database size to be the number of documents. This

metric can be easily converted to the other metrics if needed.

3.1 The Sample-Resample Method

The capture-recapture method of estimating database size [11]

requires many interactions with a database. Our goal is an

algorithm that requires few interactions, preferably one that can

reuse information that is already acquired for other reasons.

We start by assuming that resource descriptions are created by

query-based sampling, as in [13]. We assume that each resource

description lists the number of documents sampled, the terms

contained in sampled documents, and the number of sampled

documents containing each term.

We also assume that the search engine indicates the number of

documents that match a query. The number of matching

documents is a common part of search engine user interfaces,

even in uncooperative environments. Even if the search engine

only approximates the number of documents that match the

query, that number is an important clue to database size.

The sample-resample method of estimating database size is as

follows. A term from the database’s resource description is

picked randomly and submitted to the database as a single-term

query (resampling); the database returns the number of

documents that match this one-term query (the document

frequency of the term) and the ids of a few top-ranked

documents, which are discarded. Let Cj be the database, and

Cj_samp be the documents sampled from the database when the

resource description was created. Let N_cj be the (unknown)

size of Cj, and N_cj_samp be the size of Cj_samp. Let qi be the

query term selected from the resource description for Cj. Let

df_qicj be the number of documents in Cj that contain qi

(returned by the search engine) and df_qicj_samp be the number

of documents in Cj_samp that contain qi.

The event that a document sampled from the database contains

term qi is denoted as A. The event that a document from the

database contains the qi is denoted as B. The probabilities of

these events can be calculated as shown below.

()

sampc

sampcq

(6)

()

BP =

(7)

If we assume that the documents sampled from the database are

a good representation of the whole database, then P(A) ≈ P(B),

and N can be estimated as shown in Equation 8.

sampcq

sampccq

jji

Ndf

(8)

Additional estimates of the database size are acquired by

sending additional one-term queries to the database. An estimate

based on the mean of the individual estimates reduces variance.

3.2 Database Size Evaluation Metrics

Liu and Yu [11] evaluated the accuracy of database size

estimates using percentage error, which we call absolute error

ratio (AER) in this paper. Let N* denote the actual database size

and N the estimate. AER is calculated as shown below.

AER

−

(9)

When we evaluate a set of estimates for a set of databases, we

calculate the mean absolute error ratio (MAER).

3.3 Database Size Estimation Costs

A fair comparison of algorithms must consider their costs. The

significant cost for the capture-recapture and sample-resample

methods is the number of interactions with the search engine.

Liu and Yu assumed that uncooperative databases would return

ranked lists of up to 1,000 document ids [11]. This assumption is

not true in some environments. For example, AltaVista and

Google initially return only the top 10 or 20 document ids. If we

assume the search engine returns document ids in pages of 20,

then 50 interactions is required to obtain a sample of 1,000 ids.

A slight adjustment to Liu and Yu’s original capture-recapture

method reduces its cost. Only one document id from each

sample is used by the capture-recapture method. If we assume

that the method decides ahead of time which rank to take a

sample from, the number of interactions can be reduced. If the

search engine allows a particular range of the search results to

be selected then only 1 interaction per sample is required. If the

search engine requires that the list be scanned sequentially from

the beginning, in pages containing 20 document ids each, then

25 interactions is required, on average, to obtain one sample.

The primary cost of the new sample-resample algorithm is the

queries used to resample the document frequencies of a few

terms. Each resample query requires one database interaction.

The experiments reported in this paper used 5 resample queries

per database. The sample-resample method also requires access

to the resource descriptions that support resource selection.

4. RESOUCE SELECTION

The goal of resource selection is to select a small set of

resources that contain a lot of relevant documents. If the

distribution of relevant documents across the different databases

were known, databases could be ranked by the number of

relevant documents they contain; such a ranking is called a

relevance based ranking (RBR) [1,7]. Typically, and in the work

reported here, resource ranking algorithms are evaluated by how

closely they approximate a relevance-based ranking.

The number of documents relevant to query q in database Cj is

estimated as:

∈

Cjii

NCdPdrelPjq *)|(*)|()(_lRe

(10)

where Ncj is the number of documents in Cj; we can substitute

the estimated database size

N for Ncj. For the probabilities

P (di | Cj), if we have downloaded all the documents from Cj

and built a complete resource description for it, these

probabilities will be 1/Ncj. For the sampled resource

description, as long as the sampled resource description is

representative, the number of relevant documents can be

estimated as follows:

∈

sampCd

sampC

drelPjq

*)|()(_lRe

(11)

where Ncj_samp is the number of documents sampled from Cj.

The only item left to estimate is P (rel | di), the probability of

relevance given a specific document. Calculating this probability

is the goal of most probabilistic retrieval models, and is

generally viewed as a difficult problem; we do not solve it in

this research.

Instead, we define as a reference the centralized complete

database, which is the union of all of the individual databases

available in the distributed IR environment. We define

P (rel | di), the probability of relevance given a document, as the

probability of relevance given the document rank when the

centralized complete database is searched by an effective

retrieval method. This probability distribution is modeled by a

step function, which means that for the documents at the top of

the ranked list the probabilities are a positive constant, and for

all other documents they are 0. Although this approximation of

the relevance probability is rather rough, it is similar to the

modeling of relevance by most automatic relevance feedback

methods. Probability of relevance is modeled formally as:

()



∗<

otherwise

NratiodcentralRankifC

drelP

alliq

)(_

(12)

where rank_central (di) is the rank of document di in the

centralized complete database, and ratio is a threshold. This

threshold indicates how the algorithm focuses attention on

different parts of the centralized complete DB ranking. In our

experiments, the ratio was set to 0.003, which is equivalent to

considering the top 3,000 documents in a database containing

1,000,000 documents.

all

is the estimated total number of

documents in the centralized complete database. Cq is a query-

dependent constant. Although no single setting of the ratio

parameter is optimal for every testbed, experiments (not

reported here) show that the ReDDE algorithm is effective

across a wide range of ratio settings (e.g., 0.002 to 0.005).

A centralized complete database is not available. However, a

centralized

sample database is easily available; it contains the

documents obtained by query-based sampling when database

resource descriptions were constructed. The centralized sample

database is a representative subset of the centralized complete

database. Prior research showed that a centralized sample

database is useful when normalizing document scores during

result merging [13]. We use it here for resource ranking.

The query is submitted to the centralized sample database; a

document ranking is returned. Given representative resource

descriptions and database size estimates we can estimate how

documents from the different databases would construct a

ranked list if the centralized complete database existed and were

searched. In particular, the rank is calculated as follows:

)(_

_)(

)(

)(_

diSampRank

djSampRank

sampdjc

djc

diCentralRank

(13)

Plugging Equation 12 and 13 into Equation 11, the values of

)(_lRe jq

can be calculated. These values still contain a query

dependant constant Cq, which comes from Equation 12. The

useful statistic is the distribution of relevant documents in

different databases. That information is sufficient to rank the

databases. The estimated distribution can be calculated by

normalizing these values from Equation 11, as shown below.

iql

jql

jqlDist

)(_Re

)(_Re_

(14)

Equation 14 provides the computable distribution, without any

constants. The databases can now be ranked by the estimated

percentage of relevant documents they contain.

The experiments that test the effectiveness of this method are

described in Section 8.

5. RETRIEVAL PERFORMANCE

Generally the effectiveness of a distributed information retrieval

system is not evaluated by the Precision at Recall points metric

(e.g., “11-point Recall Precision”). Only a subset of the

databases is selected, so it is usually impossible to retrieve all of

the relevant documents. Precision at specified document ranks is

often used, particularly for interactive retrieval where someone

may only look at the first several screens of results.

Suppose the goal of a distributed information retrieval system is

to maximize Precision within the top-ranked 100 documents.

The goal of the resource selection algorithm is to select a small

number of databases that contain a large number of relevant

documents. Improved resource selection usually produces

improved retrieval accuracy [12], but not always. This was also

observed in [3].

If a centralized complete database were accessible, a search

algorithm would return the top 100 documents in the ranked list;

this result set is only a tiny percentage of all the documents. The

ReDDE algorithm evaluates a much larger percentage of the

centralized complete database. For example, if the ratio

(Equation 12) is 0.003 and the total testbed size is 1,000,000

documents, although the goal is to retrieve 100 relevant

documents, the resource selection algorithm essentially attempts

to maximize the number of relevant documents that would rank

in the top 3,000 in the centralized complete database.

Optimizing the number of relevant documents in the top 100 and

top 3,000 retrieved documents is correlated, but not identical.

One could decrease the ratio used by the ReDDE resource

selection algorithm. However, decisions are based on only a

small number of sampled documents; using a small ratio would

cause very few databases to have nonzero estimates. A better

solution is to use two ratios. We call this the

modified ReDDE

algorithm

. Databases that have large enough estimation values

with the smaller ratio are sorted by these values. All other

databases are sorted by estimation values created from a larger

ratio. Thus for every database there are two estimation values

(DistRel_

r1j, DistRel_r2j), which are calculated by Equation 14

using two different ratios r1 and r2. In our experiments, r1 was

empirically set to 0.0005 and r2 was set to 0.003. The procedure

can be formalized as follows:

1. First rank all the databases that have

DistRel_

r1j >= backoff_Thres

2. For all the other databases rank them with the values

DistRel_r2j

where the backoff_Thres is the backoff threshold.

Backoff_Thres and was set to 0.1 in our experiments.

6. EXPERIMENT DATA

The database size estimation and ReDDE resource selection

algorithms were tested on a variety of testbeds (Table 1).

Trec123-100col-bysource: 100 databases created from TREC

CDs 1, 2 and 3. They are organized by source and publication

date [1,13], and are somewhat heterogeneous. The sizes of the

databases are not skewed.

Trec4-kmeans: 100 databases created from TREC 4 data. A k-

means clustering algorithm was used to organize the databases

by topic [14], so the databases are homogenous and the word

distributions are very skewed. The sizes of the databases are

moderately skewed.

In order to show the effects of database sizes on database size

estimation, resource selection and document retrieval

performance in environments containing many "small" DBs and

a few "very large" DBs, an additional set of testbeds was built

from the trec123-100col-bysource collection. In these testbeds

there are many "small" DBs and a few "very large" DBs. Each

new testbed contained 2 databases that are about an order of

magnitude larger than other databases in the testbed Different

testbeds were designed to test effectiveness with different types

of "very large" databases: "representative" databases (Trec123-

2ldb-60col), "relevant" databases (Trec123-AP-WSJ-60col) and

"nonrelevant" databases (Trec123-FR-DOE-81col).

Trec123-2ldb-60col (“representative”): The databases in

Trec123-100col-bysource were sorted alphabetically. Every fifth

database, starting with the first, was collapsed into one large

“representative” database called LDB1. Every fifth database,

starting with the second, was collapsed into a large database

called LDB2. The other 60 databases were left unchanged.

LDB1 and LDB2 are about 20 times larger than the other

databases but have about the same density of relevant

documents as the other databases (Table 2).

Trec123-AP-WSJ-60col (“relevant”): The 24 Associated Press

collections in the trec123-100col-bysource testbed were

collapsed into a single large APall database. The 16 Wall Street

Journal collections were collapsed into a single large WSJall

collection (Table 2). The other 60 collections were unchanged.

The APall and WSJall collections are much larger than the other

databases, and they also have a higher density of documents

relevant to TREC queries than the other 60 collections. Most

relevant documents are contained in these two large databases.

Trec123-FR-DOE-81col (“nonrelevant”): The 13 Federal

collapsed into a single large FRall collection. The 6 Department

of Energy collections were collapsed into a single large DOEall

collection (Table 2). The other 81 collections were unchanged.

The FRall and DOEall are much larger than the other databases,

but have a much lower density of relevant documents.

Trec123-10col: This testbed was created for testing the

effectiveness of database size estimation algorithms on large

databases (Table 1). The databases in trec123-100col-bysource

were sorted alphabetically. Every tenth databases, starting with

the first, was combined to create the first new database. Every

tenth database, starting with the second, was combined to create

the second new database. And so on. Altogether there are ten

collections.

50 queries were created from the title fields of TREC topics 51-

100 for the trec123 testbed and another 50 queries were created

from the description fields of TREC topics 201-250 for the

trec4-kmeans testbed (Table 3).

7. EXPERIMENT RESULTS:

DATABASE SIZE ESTIMATION

The sample-resample database size estimation algorithm was

evaluated on two testbeds: trec123-100col and trec123-10col.

The first testbed contains 100 small databases and the second

testbed contains 10 large databases (Table 1). The capture-

recapture algorithm was used as a baseline.

The cost of capture-recapture and sample-resample is measured

in the number of remote search engine interactions required to

make an estimate. Both capture-recapture and sample-resample

send queries to a search engine and get some information

returned; for capture-recapture it is a page of 20 document ids;

for sample-resample it is the number of matching documents.

We consider these costs equivalent, i.e., one search engine

interaction. Sample-resample also assumes access to a database

resource description, which capture-recapture does not.

Ordinarily this cost (about 80 queries and 300 document

downloads [1]) would be allocated to resource selection, but to

be completely fair, for this comparison we allocate it to the

sample-resample algorithm. The total cost of the sample-

resample algorithm is thus 385 search engine interactions: 5

sample-resample queries plus 380 interactions to build a

resource description. Both algorithms are allotted 385 search

engine allocations in the experiments reported below.

The cost of a capture-recapture algorithm is affected strongly by

the type of ranked list access supported by the search engine

(Section 3.3). The algorithm is most efficient when the search

engine allows direct access to a specified section of the ranked

list; we call this the “Direct” variant. If the search engine

requires that ranked-list results be accessed sequentially in

blocks of 20 ids, the capture-recapture algorithm would only be

able to obtain about 15 samples, which is too few for an accurate

estimate. In this case we use a variant that makes its choice from

the first block of 20 ids; we call this the “Top” variant.

The basic capture-recapture algorithm [11] considered just one

document id per sample. However, it acquires 20 document ids,

so we examined the effects on accuracy of using just 1 or all 20

of the document ids returned by the search engine.

Each of the capture-recapture variants was allowed to send

about 385 queries to the database; document ids gotten in the

first half of the queries were the first sample; document ids

gotten in the second half of the queries were the second sample.

For the sample-resample experiments, a resource description

was created using query-based sampling in which randomly

selected one-term queries were submitted to the search engine

and the top 4 documents per query were downloaded until 300

documents had been downloaded. This process required about

80 queries. Only 5 resampling queries were used.

The experimental results are summarized in Table 4. Sample-

resample was more accurate than all of the capture-recapture

methods on the trec123-10col testbed, which contains larger

databases (lower values are desired for the MAER metric).

Sample-resample was more accurate than all but the “Direct

All” variant of the capture-recapture methods on the trec123-

Table 1: Summary statistics for three distributed IR testbeds.

Num of documents

(x 1000)

Size (MB)

Testbed

Size

(GB)

Min Avg Max Min Avg Max

Trec123-100col 3.2 0.7 10.8 39.7 28 32 42

Trec4-kmeans 2.0 0.3 5.7 82.7 4 20 249

Trec123-10 col 3.2 17.6 107.8 263.2 300 320 378

Table 2: Summary statistics for the “very large” databases.

Collection

Num of documents

(x 1000) Size (MB)

LDB1 231.3 665

LDB2 199.7 667

APall 242.9 764

WSJall 173.3 533

FRall 45.8 492

DOEall 226.1 194

Table 3: Query set statistics.

Collections

TREC

Topic Set

TREC

Topic Field

Average Length

(Words)

Trec123 51-100 Title 3

Trec4 201-250 Description 7

Relevant document distribution estimation method for resource selection

Figures

Citations

Sources of evidence for vertical selection

Federated Search

Mining Query Logs: Turning Search Usage Data into Knowledge

Retrieval and feedback models for blog feed search

Evaluating different methods of estimating retrieval quality for resource selection

References

Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

Some Simple Effective Approximations to the 2-Poisson Model

Distributed information retrieval

STARTS: Stanford proposal for Internet meta-searching

TREC and TIPSTER experiments with INQUERY

Related Papers (5)

Searching distributed collections with inference networks

Query-based sampling of text databases

Distributed information retrieval

Cluster-based language models for distributed retrieval

GlOSS: text-source discovery over the Internet

Frequently Asked Questions (13)

Q1. What have the authors contributed in "Relevant document distribution estimation method for resource selection" ?

Q2. What are the future works in "Relevant document distribution estimation method for resource selection" ?

Q3. What is the method of acquiring resource descriptions in uncooperative environments?

Q4. How many interactions is required to obtain one sample?

Q5. How many document ids and scores were returned by each database?

Q6. What is the way to rank a database?

Q7. How many queries were sent to the database?

Q8. How many interactions is required to obtain a sample of 1,000 ids?

Q9. How many samples would be able to be captured?

Q10. What is the skewed distribution of the databases?

Q11. What is the common use of the Precision at Recall points metric?

Q12. How many document ids does the capture-recapture algorithm acquire?

Q13. What is the use of a centralized sample database?