scispace - formally typeset
Open AccessJournal ArticleDOI

GlOSS: text-source discovery over the Internet

Reads0
Chats0
TLDR
This article describes GlOSS, Glossary of Servers Server, with two versions: bGloss, which provides a Boolean query retrieval model, and vGlOSS, which providing a vector-space retrieval model and extensively describes the methodology for measuring the retrieval effectiveness of these systems.
Abstract
The dramatic growth of the Internet has created a new problem for users: location of the relevant sources of documents. This article presents a framework for (and experimentally analyzes a solution to) this problem, which we call the text-source discovery problem. Our approach consists of two phases. First, each text source exports its contents to a centralized service. Second, users present queries to the service, which returns an ordered list of promising text sources. This article describes GlOSS, Glossary of Servers Server, with two versions: bGlOSS, which provides a Boolean query retrieval model, and vGlOSS, which provides a vector-space retrieval model. We also present hGlOSS, which provides a decentralized version of the system. We extensively describe the methodology for measuring the retrieval effectiveness of these systems and provide experimental evidence, based on actual data, that all three systems are highly effective in determining promising text sources for a given query.

read more

Content maybe subject to copyright    Report

GlOSS: Text-Source Discovery over the
Internet
LUIS GRAVANO
Columbia University
HÉCTOR GARCÍA-MOLINA
Stanford University
and
ANTHONY TOMASIC
INRIA Rocquencourt
The dramatic growth of the Internet has created a new problem for users: location of the
relevant sources of documents. This article presents a framework for (and experimentally
analyzes a solution to) this problem, which we call the text-source discovery problem. Our
approach consists of two phases. First, each text source exports its contents to a centralized
service. Second, users present queries to the service, which returns an ordered list of
promising text sources. This article describes GlOSS, Glossary of Servers Server, with two
versions: bGlOSS, which provides a Boolean query retrieval model, and vGlOSS, which
provides a vector-space retrieval model. We also present hGlOSS, which provides a decentral-
ized version of the system. We extensively describe the methodology for measuring the
retrieval effectiveness of these systems and provide experimental evidence, based on actual
data, that all three systems are highly effective in determining promising text sources for a
given query.
Categories and Subject Descriptors: H.3 [Information Systems]: Information Storage and
Retrieval
General Terms: Performance, Measurement
Additional Key Words and Phrases: Internet search and retrieval, digital libraries, text
databases, distributed information retrieval
Authors’ addresses: L. Gravano, Computer Science Department, Columbia University, 1214
Amsterdam Avenue, New York, NY 10027; email: gravano@cs.columbia.edu; H. García-
Molina, Computer Science Department, Stanford University; email: hector@cs.stanford.edu; A.
Tomasic, INRIA Rocquencourt, France; email: anthony.tomasic@inria.fr.
Permission to make digital/hard copy of part or all of this work for personal or classroom use
is granted without fee provided that the copies are not made or distributed for profit or
commercial advantage, the copyright notice, the title of the publication, and its date appear,
and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to
republish, to post on servers, or to redistribute to lists, requires prior specific permission
and/or a fee.
© 1999 ACM 0362-5915/99/0600–0229 $5.00
ACM Transactions on Database Systems, Vol. 24, No. 2, June 1999, Pages 229–264.

1. INTRODUCTION
The Internet has grown dramatically over the past few years. Document
sources are available everywhere, both within the internal networks of
organizations and on the Internet. This growth represents an incredible
wealth of information. Our goal is to help an end user find documents of
interest across potential sources on the Internet.
There are a number of options for searching over a large and distributed
collection of documents, each with its own strengths and weaknesses.
Solutions fall into two broad categories: single versus distributed search
engines. A single search engine builds a full index of the entire collection,
by scanning all documents. Some systems (e.g., Web search engines)
discard the documents and only retain the index with pointers to the
original documents; other systems warehouse the documents themselves,
providing users with access to both the index and the documents (e.g.,
Dialog, Mead Data). The index may be partitioned by topic or subcollection,
but is managed by a single search engine.
The second option is to index documents through multiple engines, each
run by the organization owning each source of documents. A global search
is managed by a metasearcher that interacts with the individual source
engines. One alternative for metasearching is to send a user query to all
engines and collect the results (e.g., MetaCrawler [Selberg and Etzioni
1995]). The user can then be directed to sites that have matching docu-
ments or to particular documents at those sites.
Another option for the multiple source scenario, one we explore in depth
in this paper, is to obtain from the engines in advance metadata that can
guide queries to sources that have many matching documents. This re-
quires the cooperation of the engines, i.e., they must export metadata
describing their collection. When the metasearcher receives a user query, it
consults its collected metadata and suggests to the user sources to try. This
solution may not be as accurate as submitting the query to all sources,
since the suggestions are only based on collection metadata. However, the
query overhead is much less, since queries are not executed everywhere.
We call the problem of identifying document sources based on exported
metadata the text-source discovery problem.
In this paper we focus on the multiple-engine scenario, and study
solutions to the text-source discovery problem. We call our family of
solutions GlOSS, for Glossary-of-Servers Server. In particular GlOSS meta-
searchers use statistical metadata, e.g., how many times each term occurs
at each source. As we show, these “summaries” are small relative to the
collection, and because they only contain statistics will be much easier for a
source to export. Statistical summaries can be obtained mechanically, and
hence are superior to manually produced summaries that are often out of
date. Similarly, since they summarize the entire collection, they are better
than summaries based on a single field (such as titles). As we will see,
GlOSS works best with a large collection of heterogeneous data sources.
That is, the subject areas covered by the different data sources are very
230 L. Gravano et al.
ACM Transactions on Database Systems, Vol. 24, No. 2, June 1999.

distinct from each other. In this case, the statistical summaries used by
GlOSS strongly distinguish each source from the others.
It is important to note that in this paper we do not compare the single
and multiple engine scenarios. First, in many cases one is not given a
choice. For example, the documents may be owned by competing organiza-
tions that do not wish to export their full collections. On the Web, for
instance, growing numbers of documents are only available through search
interfaces, and hence unavailable to the crawlers that feed search engines.
Second, if we do have a choice, the factors to consider are very diverse:
copyright issues regarding the indexing or warehousing of documents, the
cost and scalability (storage, operations) of maintaining a single index, the
frequency at which new documents are indexed, and the accuracy of the
results obtained. Instead, we only consider a multiple-engine scenario, and
study GlOSS solutions to the text-discovery problem. We compare the
“accuracy” of these solutions to what could be obtained by sending a query
to all underlying search engines.
Also note that in this paper we do not study how a user submits queries
to the individual sources. That is, once GlOSS suggests sources, the user
must submit the query there. The user or some translation service must
express the query using the particular syntax and operators used by a
source. Similarly, the user may wish to combine and rank the results
obtained at different sources. These are hard problems that are addressed
in other papers [Chang et al. 1996; Gravano et al. 1997; Gravano and
García-Molina 1997].
In summary, the contributions of this paper are as follows:
—We present a version of GlOSS (vGlOSS) that works with vector-space
search engines [Salton 1989; Salton and McGill 1983]. (These engines
treat both the documents and the queries themselves as weight vectors.)
—We describe a text-source discovery service for Boolean engines, bGlOSS.
These engines, while not as sophisticated, are still widely used.
—We define metrics for evaluating text-source discovery services.
—We experimentally evaluate vGlOSS and bGlOSS, using real document
databases. We note that even though discovery schemes for Internet
sources have been proposed and implemented by others, it is rare to find
an experimental evaluation like ours that carefully compares the various
options.
—We analyze the GlOSS storage requirements, showing that a GlOSS
index is significantly smaller than a full conventional index. We also
discuss ways to further reduce storage needs.
—We briefly describe how GlOSS services can form a hierarchy. In such a
case, services that only index a fraction of the sources can be accessed by
a higher level GlOSS service.
GlOSS: Text-Source Discovery over the Internet 231
ACM Transactions on Database Systems, Vol. 24, No. 2, June 1999.

We start in Sections 2 and 3 by presenting and evaluating our vGlOSS
and bGlOSS services. In Section 4 we discuss storage requirements,
hierarchical discovery schemes, and other issues. Finally, in Section 5 we
briefly survey related techniques, some of which could work in conjunction
with GlOSS.
2. CHOOSING VECTOR-SPACE DATABASES
In this section we present vGlOSS, a text-source discovery service that
deals with vector-space databases and queries [Gravano and García-Molina
1995a].
2.1 Overview of the Vector-Space Retrieval Model
Under the vector-space model, documents and queries are conceptually
represented as vectors [Salton 1989]. If m distinct words are available for
content identification, a document d is represented as a normalized
m-dimensional vector, D 5 ^w
1
,...,w
m
&, where w
j
is the “weight” as-
signed to the
j
th
word t
j
.Ift
j
is not present in d, then w
j
is 0. For example,
the document with vector D
1
5 ^0.5, 0, 0.3, . . . ,& contains the first word
in the vocabulary (say, by alphabetical order) with weight 0.5, does not
contain the second word, and so on.
The weight for a document word indicates how statistically important it
is. One common way to compute D is to first obtain an unnormalized vector
D95^w9
1
,...,w9
m
&, where each w9
i
is the product of a word frequency (tf)
factor and an inverse document frequency (idf) factor. The tf factor is equal
(or proportional) to the frequency of the i
th
word within the document. The
idf factor corresponds to the content discriminating power of the i-th word:
a word that appears rarely in documents has a high idf, while a word that
occurs in a large number of documents has a low idf. Typically, idf is
computed by
log~n/d
i
!, where n is the total number of documents in the
collection, and d
i
is the number of documents with the i
th
word. (If a word
appears in every document, its discriminating power is 0. If a word appears
in a single document, its discriminating power is as large as possible.) Once
D9 is computed, the normalized vector D is typically obtained by dividing
each
w9
i
term by
Î
O
i51
m
~w9
i
!
2
.
Queries in the vector-space model are also represented as normalized
vectors over the word space, Q 5 ^q
1
,...,q
m
&, where each entry indi-
cates the importance of the word in the search. Often queries are written by
a user in natural language. In this case, q
j
is typically a function of the
number of times word
t
j
appears in the query string times the idf factor for
the word. The similarity between a query q and a document d, sim~q, d!,is
defined as the inner product of the query vector Q and the document vector
D. That is,
232 L. Gravano et al.
ACM Transactions on Database Systems, Vol. 24, No. 2, June 1999.

sim~q, d! 5 Q z D 5
O
j51
m
q
j
z w
j
.
Notice that similarity values range between zero and one, inclusive, be-
cause Q and D are normalized.
Ideally, a user would like to find documents with the highest similarity to
some query. It is important to notice that similarity is always relative to
some collection. That is, the same document may be given different vectors
by two different search engines, due to the different idf factors used. Thus,
one engine may judge the document relevant to a query, while the second
one may not.
2.2 Evaluating Databases
Given a query, we would like to rank the available vector-space databases
according to their “usefulness,” or goodness for the query. In this section we
present one possible definition of goodness, with its associated ideal data-
base rank. (The next section explores how vGlOSS tries to rank the
databases as closely as possible to this ideal rank.) The goodness of a
database depends on the number of documents in the database that are
reasonably similar to the given query and on their actual similarity to the
query. The best databases are those with many documents that are highly
similar to the query in hand. However, a database might also have a high
goodness value if it holds a few documents with very high similarity, or
many documents with intermediate similarity to the query.
Our goodness definition is based solely on the answers (i.e., the document
ranks and their scores) that each database produces when presented with
the query in question. This definition does not use the relevance of the
documents to the end user who submitted the query. (The effectiveness of
information retrieval searching is based on subjective relevance assess-
ments [Salton and McGill 1983].) Using relevance would be appropriate for
evaluating the search engines at each database; instead, we are evaluating
how well vGlOSS can predict the answers that the databases return. In
Section 2.6 we discuss our choice further, and analyze some of the possible
alternatives that we could have used.
To define the ideal database rank for a query q, we need to determine
how good each database
db is for q. In this section we assume that all
databases use the same algorithms to compute weights and similarities. We
consider that the only documents in db that are useful for q are those with
a similarity to q greater than a user-provided threshold l. Documents with
lower similarity are unlikely to be useful, and therefore we ignore them.
Thus, we define:
Goodness~l, q, db! 5
O
d[Rank~l, q, db!
sim~q, d! (1)
where sim~q, d! is the similarity between query q and document d, and
Rank~l, q, db! 5 $d [ db|sim~q, d! . l%. The ideal rank of databases
GlOSS: Text-Source Discovery over the Internet 233
ACM Transactions on Database Systems, Vol. 24, No. 2, June 1999.

Citations
More filters
Proceedings ArticleDOI

Routing indices for peer-to-peer systems

TL;DR: This work introduces the concept of routing indices (RIs), which allow nodes to forward queries to neighbors that are more likely to have answers, and presents three RI schemes: the compound, the hop-count, and the exponential routing indices.
Patent

System for automatically generating queries

TL;DR: In this paper, a method, system and article of manufacture for automatically generating a query from document content is described, along with a system and a system for generating queries from documents.
Journal ArticleDOI

Database techniques for the World-Wide Web: a survey

TL;DR: The primary goal of this survey is to classify the different tasks to which database concepts have been applied, and to emphasize the technical innovations that were required to do so.
Proceedings ArticleDOI

Peer-to-peer information retrieval using self-organizing semantic overlay networks

TL;DR: Experiments show that pSearch can achieve performance comparable to centralized information retrieval systems by searching only a small number of nodes, and techniques that help distribute the indices more evenly across the nodes are described.
Patent

System with user directed enrichment and import/export control

TL;DR: In this paper, a system for enriching document content using enrichment themes includes a directed search service and an import-export service, which allows users to author documents while querying information providers using the directed searches that are inserted as part of the authored documents.
References
More filters
Book

Introduction to Modern Information Retrieval

TL;DR: Reading is a need and a hobby at once and this condition is the on that will make you feel that you must read.
Journal ArticleDOI

Searching distributed collections with inference networks

TL;DR: Methods of addressing each issue in the inference network model are described, their implementation in the INQUERY system is discussed, and experimental results demonstrating their effectiveness are presented.
Proceedings Article

SIFT: a tool for wide-area information dissemination

TL;DR: SIFT's approach to user interest modeling and user-server communication is presented and an empirical study of SIFT's performance is presented, examining its main memory requirement and ability to scale with information volume and user population.
Proceedings Article

Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies

TL;DR: al. as discussed by the authors presented gGlOSS, a generalized glossary-of-servers server that keeps statistics on the available databases to estimate which databases are the potentially most useful for a given query.
Related Papers (5)
Frequently Asked Questions (7)
Q1. What have the authors contributed in "Gloss: text-source discovery over the internet" ?

This article presents a framework for ( and experimentally analyzes a solution to ) this problem, which the authors call the text-source discovery problem. Their approach consists of two phases. This article describes GlOSS, Glossary of Servers Server, with two versions: bGlOSS, which provides a Boolean query retrieval model, and vGlOSS, which provides a vector-space retrieval model. The authors also present hGlOSS, which provides a decentralized version of the system. The authors extensively describe the methodology for measuring the retrieval effectiveness of these systems and provide experimental evidence, based on actual data, that all three systems are highly effective in determining promising text sources for a given query. Second, users present queries to the service, which returns an ordered list of promising text sources. 

For test queries, the authors used a trace of 8,392 real-user queries issued at Stanford University to the INSPEC database from 4/12 to 4/25 in 1993. 

The hGlOSS server is very small in size and easily replicated, thus eliminating the potential bottleneck of the centralized GlOSS architecture. 

Adding all the indexes, the number of entries in the INSPEC frequency information kept by bGlOSS decreases very fast as threshold increases: for threshold51, for instance, 508,978 entries, or 46.71% of the original number of entries, are eliminated. 

According to Assumption 2, each of the two documents that contain word computer will do so with weight 0.45 / 2 5 0.225, each of the 9 documents that contain word science will do so with weight 0.2 / 9 5 0.022, and so on. 

The Prospero File System is another example: Neuman [1992] lets users organize information in the Internet through the definition (and sharing) of customized views of the different objects and services. 

GlOSS can then use the negated terms to adjust the initial estimates, so that a database containing a negated term many times might see its goodness estimate for the query decreased.