How many real-user queries were used to test bGlOSS?

For test queries, the authors used a trace of 8,392 real-user queries issued at Stanford University to the INSPEC database from 4/12 to 4/25 in 1993.

What is the disadvantage of the hGlOSS server?

The hGlOSS server is very small in size and easily replicated, thus eliminating the potential bottleneck of the centralized GlOSS architecture.

How many entries are eliminated as threshold increases?

Adding all the indexes, the number of entries in the INSPEC frequency information kept by bGlOSS decreases very fast as threshold increases: for threshold51, for instance, 508,978 entries, or 46.71% of the original number of entries, are eliminated.

What is the weight of the words in the two documents that contain computer?

According to Assumption 2, each of the two documents that contain word computer will do so with weight 0.45 / 2 5 0.225, each of the 9 documents that contain word science will do so with weight 0.2 / 9 5 0.022, and so on.

What is the way to organize information in the Internet?

The Prospero File System is another example: Neuman [1992] lets users organize information in the Internet through the definition (and sharing) of customized views of the different objects and services.

What can be done to reduce the goodness of a query?

GlOSS can then use the negated terms to adjust the initial estimates, so that a database containing a negated term many times might see its goodness estimate for the query decreased.

(Open Access) GlOSS: text-source discovery over the Internet (1999) | Luis Gravano

Q: What have the authors contributed in "Gloss: text-source discovery over the internet" ?

This article presents a framework for ( and experimentally analyzes a solution to ) this problem, which the authors call the text-source discovery problem. Their approach consists of two phases. This article describes GlOSS, Glossary of Servers Server, with two versions: bGlOSS, which provides a Boolean query retrieval model, and vGlOSS, which provides a vector-space retrieval model. The authors also present hGlOSS, which provides a decentralized version of the system. The authors extensively describe the methodology for measuring the retrieval effectiveness of these systems and provide experimental evidence, based on actual data, that all three systems are highly effective in determining promising text sources for a given query. Second, users present queries to the service, which returns an ordered list of promising text sources.

GlOSS: Text-Source Discovery over the

Internet

LUIS GRAVANO

Columbia University

HÉCTOR GARCÍA-MOLINA

Stanford University

and

ANTHONY TOMASIC

INRIA Rocquencourt

The dramatic growth of the Internet has created a new problem for users: location of the

relevant sources of documents. This article presents a framework for (and experimentally

analyzes a solution to) this problem, which we call the text-source discovery problem. Our

approach consists of two phases. First, each text source exports its contents to a centralized

service. Second, users present queries to the service, which returns an ordered list of

promising text sources. This article describes GlOSS, Glossary of Servers Server, with two

versions: bGlOSS, which provides a Boolean query retrieval model, and vGlOSS, which

provides a vector-space retrieval model. We also present hGlOSS, which provides a decentral-

ized version of the system. We extensively describe the methodology for measuring the

retrieval effectiveness of these systems and provide experimental evidence, based on actual

data, that all three systems are highly effective in determining promising text sources for a

given query.

Categories and Subject Descriptors: H.3 [Information Systems]: Information Storage and

Retrieval

General Terms: Performance, Measurement

Additional Key Words and Phrases: Internet search and retrieval, digital libraries, text

databases, distributed information retrieval

Authors’ addresses: L. Gravano, Computer Science Department, Columbia University, 1214

Amsterdam Avenue, New York, NY 10027; email: gravano@cs.columbia.edu; H. García-

Molina, Computer Science Department, Stanford University; email: hector@cs.stanford.edu; A.

Tomasic, INRIA Rocquencourt, France; email: anthony.tomasic@inria.fr.

Permission to make digital/hard copy of part or all of this work for personal or classroom use

is granted without fee provided that the copies are not made or distributed for profit or

commercial advantage, the copyright notice, the title of the publication, and its date appear,

and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to

republish, to post on servers, or to redistribute to lists, requires prior specific permission

and/or a fee.

ACM Transactions on Database Systems, Vol. 24, No. 2, June 1999, Pages 229–264.

1. INTRODUCTION

The Internet has grown dramatically over the past few years. Document

sources are available everywhere, both within the internal networks of

organizations and on the Internet. This growth represents an incredible

wealth of information. Our goal is to help an end user find documents of

interest across potential sources on the Internet.

There are a number of options for searching over a large and distributed

collection of documents, each with its own strengths and weaknesses.

Solutions fall into two broad categories: single versus distributed search

engines. A single search engine builds a full index of the entire collection,

by scanning all documents. Some systems (e.g., Web search engines)

discard the documents and only retain the index with pointers to the

original documents; other systems warehouse the documents themselves,

providing users with access to both the index and the documents (e.g.,

Dialog, Mead Data). The index may be partitioned by topic or subcollection,

but is managed by a single search engine.

The second option is to index documents through multiple engines, each

run by the organization owning each source of documents. A global search

is managed by a metasearcher that interacts with the individual source

engines. One alternative for metasearching is to send a user query to all

engines and collect the results (e.g., MetaCrawler [Selberg and Etzioni

1995]). The user can then be directed to sites that have matching docu-

ments or to particular documents at those sites.

Another option for the multiple source scenario, one we explore in depth

in this paper, is to obtain from the engines in advance metadata that can

guide queries to sources that have many matching documents. This re-

quires the cooperation of the engines, i.e., they must export metadata

describing their collection. When the metasearcher receives a user query, it

consults its collected metadata and suggests to the user sources to try. This

solution may not be as accurate as submitting the query to all sources,

since the suggestions are only based on collection metadata. However, the

query overhead is much less, since queries are not executed everywhere.

We call the problem of identifying document sources based on exported

metadata the text-source discovery problem.

In this paper we focus on the multiple-engine scenario, and study

solutions to the text-source discovery problem. We call our family of

solutions GlOSS, for Glossary-of-Servers Server. In particular GlOSS meta-

searchers use statistical metadata, e.g., how many times each term occurs

at each source. As we show, these “summaries” are small relative to the

collection, and because they only contain statistics will be much easier for a

source to export. Statistical summaries can be obtained mechanically, and

hence are superior to manually produced summaries that are often out of

date. Similarly, since they summarize the entire collection, they are better

than summaries based on a single field (such as titles). As we will see,

GlOSS works best with a large collection of heterogeneous data sources.

That is, the subject areas covered by the different data sources are very

230 • L. Gravano et al.

ACM Transactions on Database Systems, Vol. 24, No. 2, June 1999.

distinct from each other. In this case, the statistical summaries used by

GlOSS strongly distinguish each source from the others.

It is important to note that in this paper we do not compare the single

and multiple engine scenarios. First, in many cases one is not given a

choice. For example, the documents may be owned by competing organiza-

tions that do not wish to export their full collections. On the Web, for

instance, growing numbers of documents are only available through search

interfaces, and hence unavailable to the crawlers that feed search engines.

Second, if we do have a choice, the factors to consider are very diverse:

cost and scalability (storage, operations) of maintaining a single index, the

frequency at which new documents are indexed, and the accuracy of the

results obtained. Instead, we only consider a multiple-engine scenario, and

study GlOSS solutions to the text-discovery problem. We compare the

“accuracy” of these solutions to what could be obtained by sending a query

to all underlying search engines.

Also note that in this paper we do not study how a user submits queries

to the individual sources. That is, once GlOSS suggests sources, the user

must submit the query there. The user or some translation service must

express the query using the particular syntax and operators used by a

source. Similarly, the user may wish to combine and rank the results

obtained at different sources. These are hard problems that are addressed

in other papers [Chang et al. 1996; Gravano et al. 1997; Gravano and

García-Molina 1997].

In summary, the contributions of this paper are as follows:

—We present a version of GlOSS (vGlOSS) that works with vector-space

search engines [Salton 1989; Salton and McGill 1983]. (These engines

treat both the documents and the queries themselves as weight vectors.)

—We describe a text-source discovery service for Boolean engines, bGlOSS.

These engines, while not as sophisticated, are still widely used.

—We define metrics for evaluating text-source discovery services.

—We experimentally evaluate vGlOSS and bGlOSS, using real document

databases. We note that even though discovery schemes for Internet

sources have been proposed and implemented by others, it is rare to find

an experimental evaluation like ours that carefully compares the various

options.

—We analyze the GlOSS storage requirements, showing that a GlOSS

index is significantly smaller than a full conventional index. We also

discuss ways to further reduce storage needs.

—We briefly describe how GlOSS services can form a hierarchy. In such a

case, services that only index a fraction of the sources can be accessed by

a higher level GlOSS service.

GlOSS: Text-Source Discovery over the Internet • 231

ACM Transactions on Database Systems, Vol. 24, No. 2, June 1999.

We start in Sections 2 and 3 by presenting and evaluating our vGlOSS

and bGlOSS services. In Section 4 we discuss storage requirements,

hierarchical discovery schemes, and other issues. Finally, in Section 5 we

briefly survey related techniques, some of which could work in conjunction

with GlOSS.

2. CHOOSING VECTOR-SPACE DATABASES

In this section we present vGlOSS, a text-source discovery service that

deals with vector-space databases and queries [Gravano and García-Molina

1995a].

2.1 Overview of the Vector-Space Retrieval Model

Under the vector-space model, documents and queries are conceptually

represented as vectors [Salton 1989]. If m distinct words are available for

content identification, a document d is represented as a normalized

m-dimensional vector, D 5 ^w

,...,w

&, where w

is the “weight” as-

signed to the

word t

.Ift

is not present in d, then w

is 0. For example,

the document with vector D

5 ^0.5, 0, 0.3, . . . ,& contains the first word

in the vocabulary (say, by alphabetical order) with weight 0.5, does not

contain the second word, and so on.

The weight for a document word indicates how statistically important it

is. One common way to compute D is to first obtain an unnormalized vector

D95^w9

,...,w9

&, where each w9

is the product of a word frequency (tf)

factor and an inverse document frequency (idf) factor. The tf factor is equal

(or proportional) to the frequency of the i

word within the document. The

idf factor corresponds to the content discriminating power of the i-th word:

a word that appears rarely in documents has a high idf, while a word that

occurs in a large number of documents has a low idf. Typically, idf is

computed by

log~n/d

!, where n is the total number of documents in the

collection, and d

is the number of documents with the i

word. (If a word

appears in every document, its discriminating power is 0. If a word appears

in a single document, its discriminating power is as large as possible.) Once

D9 is computed, the normalized vector D is typically obtained by dividing

each

term by

i51

~w9

Queries in the vector-space model are also represented as normalized

vectors over the word space, Q 5 ^q

,...,q

&, where each entry indi-

cates the importance of the word in the search. Often queries are written by

a user in natural language. In this case, q

is typically a function of the

number of times word

appears in the query string times the idf factor for

the word. The similarity between a query q and a document d, sim~q, d!,is

defined as the inner product of the query vector Q and the document vector

D. That is,

232 • L. Gravano et al.

ACM Transactions on Database Systems, Vol. 24, No. 2, June 1999.

sim~q, d! 5 Q z D 5

j51

z w

Notice that similarity values range between zero and one, inclusive, be-

cause Q and D are normalized.

Ideally, a user would like to find documents with the highest similarity to

some query. It is important to notice that similarity is always relative to

some collection. That is, the same document may be given different vectors

by two different search engines, due to the different idf factors used. Thus,

one engine may judge the document relevant to a query, while the second

one may not.

2.2 Evaluating Databases

Given a query, we would like to rank the available vector-space databases

according to their “usefulness,” or goodness for the query. In this section we

present one possible definition of goodness, with its associated ideal data-

base rank. (The next section explores how vGlOSS tries to rank the

databases as closely as possible to this ideal rank.) The goodness of a

database depends on the number of documents in the database that are

reasonably similar to the given query and on their actual similarity to the

query. The best databases are those with many documents that are highly

similar to the query in hand. However, a database might also have a high

goodness value if it holds a few documents with very high similarity, or

many documents with intermediate similarity to the query.

Our goodness definition is based solely on the answers (i.e., the document

ranks and their scores) that each database produces when presented with

the query in question. This definition does not use the relevance of the

documents to the end user who submitted the query. (The effectiveness of

information retrieval searching is based on subjective relevance assess-

ments [Salton and McGill 1983].) Using relevance would be appropriate for

evaluating the search engines at each database; instead, we are evaluating

how well vGlOSS can predict the answers that the databases return. In

Section 2.6 we discuss our choice further, and analyze some of the possible

alternatives that we could have used.

To define the ideal database rank for a query q, we need to determine

how good each database

db is for q. In this section we assume that all

databases use the same algorithms to compute weights and similarities. We

consider that the only documents in db that are useful for q are those with

a similarity to q greater than a user-provided threshold l. Documents with

lower similarity are unlikely to be useful, and therefore we ignore them.

Thus, we define:

Goodness~l, q, db! 5

d[Rank~l, q, db!

sim~q, d! (1)

where sim~q, d! is the similarity between query q and document d, and

Rank~l, q, db! 5 $d [ db|sim~q, d! . l%. The ideal rank of databases

GlOSS: Text-Source Discovery over the Internet • 233

ACM Transactions on Database Systems, Vol. 24, No. 2, June 1999.

GlOSS: text-source discovery over the Internet

Figures

Citations

Routing indices for peer-to-peer systems

System for automatically generating queries

Database techniques for the World-Wide Web: a survey

Peer-to-peer information retrieval using self-organizing semantic overlay networks

System with user directed enrichment and import/export control

References

Introduction to Modern Information Retrieval

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Searching distributed collections with inference networks

SIFT: a tool for wide-area information dissemination

Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies

Related Papers (5)

Searching distributed collections with inference networks

Query-based sampling of text databases

Cluster-based language models for distributed retrieval

Distributed information retrieval

Chord: A scalable peer-to-peer lookup service for internet applications

Frequently Asked Questions (7)

Q1. What have the authors contributed in "Gloss: text-source discovery over the internet" ?

Q2. How many real-user queries were used to test bGlOSS?

Q3. What is the disadvantage of the hGlOSS server?

Q4. How many entries are eliminated as threshold increases?

Q5. What is the weight of the words in the two documents that contain computer?

Q6. What is the way to organize information in the Internet?

Q7. What can be done to reduce the goodness of a query?