Journal Article•DOI•

Automatic resource compilation by analyzing hyperlink structure and associated text

Soumen Chakrabarti¹, Byron Dom¹, Prabhakar Raghavan¹, Sridhar Rajagopalan¹, David Gibson², Jon Kleinberg³ - Show less +2 more•Institutions (3)

IBM¹, University of California, Berkeley², Cornell University³

01 Apr 1998-Vol. 30, pp 65-74

TL;DR: An evaluation of ARC suggests that the resources found by ARC frequently fare almost as well as, and sometimes better than, lists of resources that are manually compiled or classified into a topic.

read less

Abstract: We describe the design, prototyping and evaluation of ARC, a system for automatically compiling a list of authoritative Web resources on any (sufficiently broad) topic. The goal of ARC is to compile resource lists similar to those provided by Yahoo! or Infoseek. The fundamental difference is that these services construct lists either manually or through a combination of human and automated effort, while ARC operates fully automatically. We describe the evaluation of ARC, Yahoo!, and Infoseek resource lists by a panel of human users. This evaluation suggests that the resources found by ARC frequently fare almost as well as, and sometimes better than, lists of resources that are manually compiled or classified into a topic. We also provide examples of ARC resource lists for the reader to examine.

...read moreread less

Summary (3 min read)

Jump to: [Introduction] – [1. Overview] – [1.1. Related prior work] – [2. Algorithm] – [2.2. Implementation] – [3. Experiments] – [3.1. Topics and baselines for comparison] – [3.2. Volunteers and test setup] – [27 were chosen by at least one volunteer. Overall, 54 records were received, each having nine numerical scores, corresponding to the three sources (ARC, Yahoo, and Infoseek) and the three measures (Accuracy, Coverage, and Overall). There was thus an average of two reviews per topic.] – [4.2. Summary of comments by evaluators] and [5. Conclusions]

Introduction

The authors describe the design, prototyping and evaluation of ARC, a system for automatically compiling a list of authoritative web resources on any (sufficiently broad) topic.
The fundamental difference is that these services construct lists either manually or through a combination of human and automated effort, while ARC operates fully automatically.
The authors describe the evaluation of ARC, Yahoo!, and Infoseek resource lists by a panel of human users.
This evaluation suggests that the resources found by ARC frequently fare almost as well as, and sometimes better than, lists of resources that are manually compiled or classified into a topic.
Search, taxonomies, link analysis, anchor text, information retrieval, also known as Keywords.

1. Overview

The subject of this paper is the design and evaluation of an automatic resource compiler.
An automatic resource compiler is a system which, given a topic that is broad and well-represented on the web, will seek out and return a list of web resources that it considers the most authoritative for that topic.
To their knowledge, this is one of the first systematic user-studies comparing the quality of multiple web resource lists compiled using different methods.
The role of such a taxonomy is to provide, for any broad topic, such a resource list with high-quality resources on the topic.
As their studies with human users show, the loss in quality is not significant compared to manually or semi-manually compiled lists.

2. Algorithm

The authors now describe their algorithm, and the experiments that they use to set values for the small number of parameters in the algorithm.
Note that this iterative process ignores the text describing the topics; the authors remedy this by altering these sums to be weighted in a fashion described below, so as to maintain focus on the topic.
The second step in each iteration reflects the notion that good hub pages point to good authority pages and describe them as being relevant to the topic text.
First, the authors have empirically observed that convergence is quite rapid for the matrices that they are dealing with.
Postulating that the string <a href="http://www.yahoo.com"> would typically co-occur with the text Yahoo in close proximity, the authors studied - on a test set of over 5000 web pages drawn from the web - the distance to the nearest occurrence of Yahoo around all href’s to http://www.yahoo.com in these pages.

2.2. Implementation

An 80GB disk-based web-cache hosted on a PC enables us to store augmented sets for various topics locally, allowing us to repeat text and link analysis for various parameter settings.
The emphasis in their current implementation has not been heavy-duty performance (in that the authors do not envision their system fielding thousands of queries per second and producing answers in real time); instead, they focused on the quality of their resource lists.
The iterative computation at the core of the analysis takes about a second for a single resource list, on a variety of modern platforms.
The authors expect that, in a full fledged taxonomy generation, the principal bottleneck will be the time cost of crawling the web and extracting all the root and augmented sets.

3. Experiments

In this section the authors describe the setup by which a panel of human users evaluated their resource lists in comparison with Yahoo! and Infoseek.
The parameters for this experiment are: (1) the choice of topics; (2) well-known sources to compare with the output of ARC; (3) the metrics for evaluation; (4) and the choice of volunteers to test the output of their system.

3.1. Topics and baselines for comparison

The authors test topics had to be chosen so that with some reasonable browsing effort their volunteers could make judgments about the quality of their output, even if they are not experts on the topic.
One way the volunteers could do this relatively easily is through comparison with similar resource pages in well-known Web directories.
Several Web directories, such as Yahoo!, Infoseek, etc. are regarded as "super-hubs"; therefore it is natural to pick such directories for comparison.
Most topics were picked so that there were representative "resource" pages in both Yahoo! and Infoseek.
The authors tried to touch topics in arts, sciences, health, entertainment, and social issues.

3.2. Volunteers and test setup

The participants range in age from early 20’s to 50’s, and were spread around North America and Asia.
Each participant was assigned one or two topics to rate, and could optionally rate any others they wished to.
There were several basic reasons for this.
With respect to their view of the given topic after 15-30 minutes of searching, the participants were asked to rate how broadly the resource list covered the topic, and to what extent it failed to cover certain broad aspects of the topic.
On a scale of 1-10, participants rated each resource list on its "comprehensiveness" with respect to the given topic.

27 were chosen by at least one volunteer. Overall, 54 records were received, each having nine numerical scores, corresponding to the three sources (ARC, Yahoo, and Infoseek) and the three measures (Accuracy, Coverage, and Overall). There was thus an average of two reviews per topic.

The authors first study the perceived overall quality of ARC relative to Infoseek and Yahoo.
Figure 1 shows the ratio of ARC’s score to that of Infoseek and Yahoo.
The y-axis value for "indifference" is equal to one; values exceeding one mean ARC is adjudged better, and less than one mean ARC is worse.
Some of the topics for which ARC scores relatively poorly are affirmative action and mutual funds, topics with large web presence and plenty of carefully hand-crafted resource pages.
The authors also show in Figure 4 a scatter plot of relative accuracy to relative coverage over all volunteers and all topics.

4.2. Summary of comments by evaluators

In addition to the scores, a number of evaluators provided comments supporting their evaluations and providing suggestions for improving their resource compiler.
On the whole, respondents liked the explicit distinction between hubs and authorities, although some respondents found pages among the authorities that they clearly regarded as hubs.
There were cases where evaluators sought resources at levels more narrow or broad than those offered by their system.
By far the most common suggestion for improving their resource lists was their presentation.
This gives users a quick visual cue of where to continue their search for further information.

5. Conclusions

Searching for authoritative web resources on broad topics is a common task for users of the WWW.
The authors began from this underlying observation, and presented a method for the automated compilation of resource lists, based on a combination of text and link analysis for distilling authoritative Web resources.
To their knowledge, this is the first time link and text analysis have been combined for this purpose.
The authors described the selection of the algorithm parameters, and a user study on selected topics comparing their lists to those of Yahoo! and Infoseek.
The user study showed that (1) their automatically-compiled resource lists were almost competitive with, and occasionally better than the (semi-)manually compiled lists; and (2) for many of the users in the study, their "flat list" presentation of the results put us at a disadvantage - the commonest requests were for minor additional annotation that is often easy to automate.

Did you find this useful? Give us your feedback

Figures (4)

Figure 1: Ratio of overall ARC scores to Infoseek and Yahoo scores for each of the topics. The blue/light bars represent the ratio of ARC to Infoseek, and the purple/dark bars represent the ratio of ARC to Yahoo. The last pair of bars show the average ratio of scores.

Table 3: A sample of ARC output from where the evaluators started exploring.

Figure 4: Scatter plot of relative accuracy to relative coverage over all volunteers and all topics.

Content maybe subject to copyright Report

Automatic Resource Compilation by Analyzing Hyperlink

Structure and Associated Text

Soumen Chakrabarti, Byron Dom, Prabhakar Raghavan, Sridhar Rajagopalan

IBM Almaden Research Center K53, 650 Harry Road

San Jose, CA 95120, USA.

David Gibson

Computer Science Division, Soda Hall

University of California, Berkeley, CA 94720, USA.

Jon Kleinberg

Department of Computer Science, Upson Hall

Cornell University, Ithaca, NY 14853, USA.

Abstract

We describe the design, prototyping and evaluation of ARC, a system for automatically

compiling a list of authoritative web resources on any (sufficiently broad) topic. The goal of

ARC is to compile resource lists similar to those provided by Yahoo! or Infoseek. The

fundamental difference is that these services construct lists either manually or through a

combination of human and automated effort, while ARC operates fully automatically. We

describe the evaluation of ARC, Yahoo!, and Infoseek resource lists by a panel of human

users. This evaluation suggests that the resources found by ARC frequently fare almost as

well as, and sometimes better than, lists of resources that are manually compiled or

classified into a topic. We also provide examples of ARC resource lists for the reader to

examine.

Keywords: Search, taxonomies, link analysis, anchor text, information retrieval.

1. Overview

The subject of this paper is the design and evaluation of an automatic resource compiler. An automatic

resource compiler is a system which, given a topic that is broad and well-represented on the web, will

seek out and return a list of web resources that it considers the most authoritative for that topic. Our

system is built on an algorithm that performs a local analysis of both text and links to arrive at a "global

consensus" of the best resources for the topic. We describe a user-study, comparing our resource

compiler with commercial, human-compiled/assisted services. To our knowledge, this is one of the first

systematic user-studies comparing the quality of multiple web resource lists compiled using different

methods. Our study suggests that, although our resource lists are compiled wholly automatically (and

despite being presented to users without any embellishments in the "look and feel" or the presentation

context), they fare relatively well compared to the commercial human-compiled lists.

When web users seek definitive information on a broad topic, they frequently go to a hierarchical,

manually-compiled taxonomy such as Yahoo!, or a human-assisted compilation such as Infoseek. The

role of such a taxonomy is to provide, for any broad topic, such a resource list with high-quality

resources on the topic. In this paper we describe ARC (for Automatic Resource Compiler), a part of the

CLEVER project on information retrieval at the IBM Almaden Research Center. The goal of ARC is to

automatically compile a resource list on any topic that is broad and well-represented on the web. By

using an automated system to compile resource lists, we obtain a faster coverage of the available

resources and of the topic space than a human can achieve (or, alternatively, are able to update and

maintain more resource lists more frequently). As our studies with human users show, the loss in quality

is not significant compared to manually or semi-manually compiled lists.

1.1. Related prior work

The use of links for ranking documents is similar to work on citation analysis in the field of

bibliometrics (see e.g. [White and McCain]). In the context of the Web, links have been used for

enhancing relevance judgments by [Rivlin, Botafogo, and Schneiderman] and [Weiss et al]. They have

been incorporated into query-based frameworks for searching by [Arocena, Mendelzon, and Mihaila]

and by [Spertus].

Our work is oriented in a different direction - namely, to use links as a means of harnessing the latent

human annotation in hyper-links so as to broaden a user search and focus on a type of ‘high-quality’

page. Similar motivation arises in work of [Pirolli, Pitkow, and Rao]; [Carriere and Kazman]; and Brin

and Page [BrinPage97]. Pirolli et al. discuss a method based on link and text-based information for

grouping and categorizing WWW pages. Carriere and Kazman use the number of neighbors (without

regard to the directions of links) of a page in the link structure as a method of ranking pages; and Brin

and Page view web searches as random walks to assign a topic-independent "rank" to each page on the

WWW, which can then be used to re-order the output of a search engine. [For a more detailed review of

search engines and their rank functions (including some based on the number of links pointing to a web

page) see Search Engine Watch [SEW].] Finally, the link-based algorithm of Kleinberg [Kleinberg97]

serves as one of the building blocks of our method here; this connection is described in more detail in

Section 2 below, explaining how we enhance it with textual analysis.

1.2. Road map

We begin in Section 2 below with a description of our technique and how some of its parameters are

fixed. In Section 3 we describe our experiments using a number of topics, with a diverse set of users

from many different backgrounds. In Section 4 we summarize the ratings given by these evaluators, as

well as their qualitative comments and suggestions.

2. Algorithm

We now describe our algorithm, and the experiments that we use to set values for the small number of

parameters in the algorithm. The algorithm has three phases: a search-and-growth phase, a weighting

phase, and an iteration-and-reporting phase.

Given a topic, the algorithm first gathers a collection of pages from among which it will distill ones that

it considers the best for the topic. This is the intent of the first phase, which is nearly identical to that in

Kleinberg’s HITS technique [Kleinberg97]. The topic is sent to a term-based search engine - AltaVista

in our case - and a root set of 200 documents containing the topic term(s) is collected. The particular

root set returned by the search engine (among all the web resources containing the topic as a text string)

is determined by its own scoring function. The root set is then augmented through the following

expansion step: we add to the root set (1) any document that points to a document in the root set, and (2)

any document that is pointed to by a document in the root set. We perform this expansion step twice (in

Kleinberg’s work, this was performed only once), thus including all pages which are link-distance two

or less from at least one page in the root set. We will call the set of documents obtained in this way the

augmented set. In our experience, the augmented set contained between a few hundred and 3000 distinct

pages, depending on the topic.

We now develop two fundamental ideas. The first idea, due to Kleinberg [Kleinberg97] is that there are

two types of useful pages. An authority page is one that contains a lot of information about the topic. A

hub page is one that contains a large number of links to pages containing information about the topic -

an example of a hub page is a resource list on some specific topic. The basic principle here is the

following mutually reinforcing relationship between hubs and authorities. A good hub page points to

many good authority pages. A good authority page is pointed to by many good hub pages. To convert

this principle into a method for finding good hubs and authorities, we first describe a local iterative

process [Kleinberg97] that "bootstraps" the mutually reinforcing relationship described above to locate

good hubs and authorities. We then present the second fundamental notion underlying our algorithm,

which sharpens its accuracy when focusing on a topic. Finally, we present our overall algorithm and a

description of the experiments that help us fix its parameters.

Kleinberg maintains, for each page p in the augmented set, a hub score, h(p) and an authority score,

a(p). Each iteration consists of two steps: (1) replace each a(p) by the sum of the h(p) values of pages

pointing to p; (2) replace each h(p) by the sum of the a(p) values of pages pointed to by p. Note that this

iterative process ignores the text describing the topics; we remedy this by altering these sums to be

weighted in a fashion described below, so as to maintain focus on the topic. The idea is to iterate this

new, text-weighted process for a number of steps, then pick the pages with the top hub and authority

scores.

To this end we introduce our second fundamental notion: the text around href links to a page p is

descriptive of the contents of p; note that these href’s are not in p, but in pages pointing to p. In

particular, if text descriptive of a topic occurs in the text around an href into p from a good hub, it

reinforces our belief that p is an authority on the topic. How do we incorporate this textual conferral of

authority into the basic iterative process described above? The idea is to assign to each link (from page p

to page q of the augmented set) a positive numerical weight w(p,q) that increases with the amount of

topic-related text in the vicinity of the href from p to q. This assignment is the second, weighting phase

mentioned above. The precise mechanism we use for computing these weights is described in Section

2.1 below; for now, let us continue to the iteration and reporting phase, assuming that this

topic-dependent link weighting has been done.

In the final phase, we compute two vectors h (for hub) and a (for authority), with one entry for each

page in the augmented set. The entries of the first contain scores for the value of each page as a hub, and

the second describes the value of each page as an authority. We construct a matrix W that contains an

entry corresponding to each ordered pair p,q of pages in the augmented set. This entry is w(p,q)

(computed as below) when page p points to q, and 0 otherwise. Let Z be the matrix transpose of W. We

set the vector h equal to 1 initially and iteratively execute the following two steps k times.

a = Wh

h = Za

After k iterations we output the pages with the 15 highest values in h as the hubs, and the 15 highest

values in a as the authorities, without further annotation or human filtering. Thus our process is

completely automated. Our choice of the quantity 15 here is somewhat arbitrary: it arose from our sense

that a good resource list should offer the user a set of pointers that is easy to grasp from a single browser

frame.

Intuitively, the first step in each iteration reflects the notion that good authority pages are pointed to by

hub pages and are described in them as being relevant to the topic text. The second step in each iteration

reflects the notion that good hub pages point to good authority pages and describe them as being

relevant to the topic text. What do we set k to? It follows from the theory of eigenvectors [Golub89] that,

as k increases, the relative values of the components of h and a converge to a unique steady state, given

that the entries of W are non-negative real numbers. In our case, a very small value of k is sufficient --

and hence the computation can be performed extremely efficiently -- for two reasons. First, we have

empirically observed that convergence is quite rapid for the matrices that we are dealing with. Second,

we need something considerably weaker than convergence: we only require that the identities of the top

15 hubs/authorities become stable, since this is all that goes into the final resource list. This is an

important respect in which our goals differ from those of classical matrix eigenvector computations:

whereas that literature focuses on the time required for the values of the eigenvector components to

reach a stable state, our emphasis is only on identifying the 15 largest entries without regard to the actual

values of these entries. We found on a wide range of tests that this type of "near-convergence" occurs

around five iterations. We therefore decided to fix k to be 5.

2.1. Computing the weights w(p,q)

Recall that the weight w(p,q) is a measure of the authority on the topic invested by p in q. If the text in

the vicinity of the href from p to q contains text descriptive of the topic at hand, we want to increase

w(p,q); this idea of anchor-text arose first in the work of McBryan[McBryan94]. The immediate

questions, then, are (1) what precisely is "vicinity"? and (2) how do we map the occurrences of

descriptive text into a real-valued weight? Our idea is to look on either side of the href for a window of

B bytes, where B is a parameter determined through an experiment described below; we call this the

anchor window. Note that this includes the text between the <a href="..."> and </a> tags. Let n(t)

denote the number of matches between terms in the topic description in this anchor window. For this

purpose, a term may be specified as a contiguous string of words. We set

w(p,q) = 1 + n(t).

Since many entries of W are larger than one, the entries of h and a may grow as we iterate; however,

since we only need their relative values, we normalize after each iteration to keep the entries small.

Finally, we describe the determination of B, the parameter governing the width of the anchor window.

Postulating that the string <a href="http://www.yahoo.com"> would typically co-occur with the text

Yahoo in close proximity, we studied - on a test set of over 5000 web pages drawn from the web - the

distance to the nearest occurrence of Yahoo around all href’s to http://www.yahoo.com in these pages.

The results are shown below in Table 1; the first row indicates distance from the string <a

href="http://www.yahoo.com">, while the second row indicates the number of occurrences of the string

Yahoo at that distance. Here a distance of zero corresponds to occurrences between <a

href="http://www.yahoo.com"> and </a>. A negative distance connotes occurrences before the href, and

positive distances after.

Distance -100 -75 -50 -25 0 25 50 75 100

Density 1 6 11 31 880 73 112 21 7

Table 1: Anchor text position versus distance.

The table suggests that most occurrences are within 50 bytes of the href. Qualitatively similar

experiments with href’s other than Yahoo (where the text associated with a URL is likely to be as

clear-cut) suggested similar values of B. We therefore set B to be 50.

2.2. Implementation

Our experimental system consists of a computation kernel written in C, and a control layer and GUI

written in Tcl/Tk. An 80GB disk-based web-cache hosted on a PC enables us to store augmented sets for

various topics locally, allowing us to repeat text and link analysis for various parameter settings. The

emphasis in our current implementation has not been heavy-duty performance (in that we do not

envision our system fielding thousands of queries per second and producing answers in real time);

instead, we focused on the quality of our resource lists. The iterative computation at the core of the

analysis takes about a second for a single resource list, on a variety of modern platforms. We expect

that, in a full fledged taxonomy generation, the principal bottleneck will be the time cost of crawling the

web and extracting all the root and augmented sets.

3. Experiments

In this section we describe the setup by which a panel of human users evaluated our resource lists in

comparison with Yahoo! and Infoseek. The parameters for this experiment are: (1) the choice of topics;

(2) well-known sources to compare with the output of ARC; (3) the metrics for evaluation; (4) and the

choice of volunteers to test the output of our system.

3.1. Topics and baselines for comparison

Our test topics had to be chosen so that with some reasonable browsing effort our volunteers could make

judgments about the quality of our output, even if they are not experts on the topic. One way the

volunteers could do this relatively easily is through comparison with similar resource pages in

well-known Web directories. Several Web directories, such as Yahoo!, Infoseek, etc. are regarded as

"super-hubs"; therefore it is natural to pick such directories for comparison. This in part dictated the

possible topics we could experiment on.

We started by picking a set of topics, each described by a word or a short phrase (2-3 words). Most

topics were picked so that there were representative "resource" pages in both Yahoo! and Infoseek. We

tried to touch topics in arts, sciences, health, entertainment, and social issues. We picked 28 topics for

our study: affirmative action, alcoholism, amusement parks, architecture, bicycling, blues, classical

guitar, cheese, cruises, computer vision, field hockey, gardening, graphic design, Gulf war, HIV, lyme

disease, mutual funds, parallel architecture, rock climbing, recycling cans, stamp collecting,

Shakespeare, sushi, telecommuting, Thailand tourism, table tennis, vintage cars and zen buddhism. We

therefore believe that our system was tested on fairly representative topics for which typical web users

are likely to seek authoritative resource lists.

3.2. Volunteers and test setup

The participants range in age from early 20’s to 50’s, and were spread around North America and Asia.

HTML Viewer

Frequently Asked Questions (7)

Q1. What have the authors contributed in "Automatic resource compilation by analyzing hyperlink structure and associated text" ?

The authors describe the design, prototyping and evaluation of ARC, a system for automatically compiling a list of authoritative web resources on any ( sufficiently broad ) topic. The goal of ARC is to compile resource lists similar to those provided by Yahoo ! or Infoseek. The authors describe the evaluation of ARC, Yahoo !, and Infoseek resource lists by a panel of human users. The authors also provide examples of ARC resource lists for the reader to examine. This evaluation suggests that the resources found by ARC frequently fare almost as well as, and sometimes better than, lists of resources that are manually compiled or classified into a topic.

Q2. How do the authors normalize the entries of h and a?

Since many entries of W are larger than one, the entries of h and a may grow as the authors iterate; however, since the authors only need their relative values, the authors normalize after each iteration to keep the entries small.

Q3. How long does it take to compute a resource list?

The iterative computation at the core of the analysis takes about a second for a single resource list, on a variety of modern platforms.

Q4. What are the usual concerns that Web users express while searching for resources?

Web users express while searching for resources, which are related to the notions of recall and precision in the Information Retrieval literature.

Q5. What is the emphasis of the experiment?

The emphasis in their current implementation has not been heavy-duty performance (in that the authors do not envision their system fielding thousands of queries per second and producing answers in real time); instead, the authors focused on the quality of their resource lists.

Q6. What is the important reason why the authors decided to fix k to 5.2.1?

In their case, a very small value of k is sufficient -- and hence the computation can be performed extremely efficiently -- for two reasons.

Q7. What is the purpose of the algorithm?

Their work is oriented in a different direction - namely, to use links as a means of harnessing the latent human annotation in hyper-links so as to broaden a user search and focus on a type of ‘high-quality’ page.

Automatic resource compilation by analyzing hyperlink structure and associated text

Summary (3 min read)

Introduction

1. Overview

2. Algorithm

2.2. Implementation

3. Experiments

3.1. Topics and baselines for comparison

3.2. Volunteers and test setup

27 were chosen by at least one volunteer. Overall, 54 records were received, each having nine numerical scores, corresponding to the three sources (ARC, Yahoo, and Infoseek) and the three measures (Accuracy, Coverage, and Overall). There was thus an average of two reviews per topic.

4.2. Summary of comments by evaluators

5. Conclusions

Figures (4)

Citations

Cites background from "Automatic resource compilation by a..."

References

Related Papers (5)

Frequently Asked Questions (7)

Q1. What have the authors contributed in "Automatic resource compilation by analyzing hyperlink structure and associated text" ?

Q2. How do the authors normalize the entries of h and a?

Q3. How long does it take to compute a resource list?

Q4. What are the usual concerns that Web users express while searching for resources?

Q5. What is the emphasis of the experiment?

Q6. What is the important reason why the authors decided to fix k to 5.2.1?

Q7. What is the purpose of the algorithm?