Automatic resource compilation by analyzing hyperlink structure and associated text
Summary (3 min read)
Introduction
- The authors describe the design, prototyping and evaluation of ARC, a system for automatically compiling a list of authoritative web resources on any (sufficiently broad) topic.
- The fundamental difference is that these services construct lists either manually or through a combination of human and automated effort, while ARC operates fully automatically.
- The authors describe the evaluation of ARC, Yahoo!, and Infoseek resource lists by a panel of human users.
- This evaluation suggests that the resources found by ARC frequently fare almost as well as, and sometimes better than, lists of resources that are manually compiled or classified into a topic.
- Search, taxonomies, link analysis, anchor text, information retrieval, also known as Keywords.
1. Overview
- The subject of this paper is the design and evaluation of an automatic resource compiler.
- An automatic resource compiler is a system which, given a topic that is broad and well-represented on the web, will seek out and return a list of web resources that it considers the most authoritative for that topic.
- To their knowledge, this is one of the first systematic user-studies comparing the quality of multiple web resource lists compiled using different methods.
- The role of such a taxonomy is to provide, for any broad topic, such a resource list with high-quality resources on the topic.
- As their studies with human users show, the loss in quality is not significant compared to manually or semi-manually compiled lists.
2. Algorithm
- The authors now describe their algorithm, and the experiments that they use to set values for the small number of parameters in the algorithm.
- Note that this iterative process ignores the text describing the topics; the authors remedy this by altering these sums to be weighted in a fashion described below, so as to maintain focus on the topic.
- The second step in each iteration reflects the notion that good hub pages point to good authority pages and describe them as being relevant to the topic text.
- First, the authors have empirically observed that convergence is quite rapid for the matrices that they are dealing with.
- Postulating that the string <a href="http://www.yahoo.com"> would typically co-occur with the text Yahoo in close proximity, the authors studied - on a test set of over 5000 web pages drawn from the web - the distance to the nearest occurrence of Yahoo around all href’s to http://www.yahoo.com in these pages.
2.2. Implementation
- An 80GB disk-based web-cache hosted on a PC enables us to store augmented sets for various topics locally, allowing us to repeat text and link analysis for various parameter settings.
- The emphasis in their current implementation has not been heavy-duty performance (in that the authors do not envision their system fielding thousands of queries per second and producing answers in real time); instead, they focused on the quality of their resource lists.
- The iterative computation at the core of the analysis takes about a second for a single resource list, on a variety of modern platforms.
- The authors expect that, in a full fledged taxonomy generation, the principal bottleneck will be the time cost of crawling the web and extracting all the root and augmented sets.
3. Experiments
- In this section the authors describe the setup by which a panel of human users evaluated their resource lists in comparison with Yahoo! and Infoseek.
- The parameters for this experiment are: (1) the choice of topics; (2) well-known sources to compare with the output of ARC; (3) the metrics for evaluation; (4) and the choice of volunteers to test the output of their system.
3.1. Topics and baselines for comparison
- The authors test topics had to be chosen so that with some reasonable browsing effort their volunteers could make judgments about the quality of their output, even if they are not experts on the topic.
- One way the volunteers could do this relatively easily is through comparison with similar resource pages in well-known Web directories.
- Several Web directories, such as Yahoo!, Infoseek, etc. are regarded as "super-hubs"; therefore it is natural to pick such directories for comparison.
- Most topics were picked so that there were representative "resource" pages in both Yahoo! and Infoseek.
- The authors tried to touch topics in arts, sciences, health, entertainment, and social issues.
3.2. Volunteers and test setup
- The participants range in age from early 20’s to 50’s, and were spread around North America and Asia.
- Each participant was assigned one or two topics to rate, and could optionally rate any others they wished to.
- There were several basic reasons for this.
- With respect to their view of the given topic after 15-30 minutes of searching, the participants were asked to rate how broadly the resource list covered the topic, and to what extent it failed to cover certain broad aspects of the topic.
- On a scale of 1-10, participants rated each resource list on its "comprehensiveness" with respect to the given topic.
27 were chosen by at least one volunteer. Overall, 54 records were received, each having nine numerical scores, corresponding to the three sources (ARC, Yahoo, and Infoseek) and the three measures (Accuracy, Coverage, and Overall). There was thus an average of two reviews per topic.
- The authors first study the perceived overall quality of ARC relative to Infoseek and Yahoo.
- Figure 1 shows the ratio of ARC’s score to that of Infoseek and Yahoo.
- The y-axis value for "indifference" is equal to one; values exceeding one mean ARC is adjudged better, and less than one mean ARC is worse.
- Some of the topics for which ARC scores relatively poorly are affirmative action and mutual funds, topics with large web presence and plenty of carefully hand-crafted resource pages.
- The authors also show in Figure 4 a scatter plot of relative accuracy to relative coverage over all volunteers and all topics.
4.2. Summary of comments by evaluators
- In addition to the scores, a number of evaluators provided comments supporting their evaluations and providing suggestions for improving their resource compiler.
- On the whole, respondents liked the explicit distinction between hubs and authorities, although some respondents found pages among the authorities that they clearly regarded as hubs.
- There were cases where evaluators sought resources at levels more narrow or broad than those offered by their system.
- By far the most common suggestion for improving their resource lists was their presentation.
- This gives users a quick visual cue of where to continue their search for further information.
5. Conclusions
- Searching for authoritative web resources on broad topics is a common task for users of the WWW.
- The authors began from this underlying observation, and presented a method for the automated compilation of resource lists, based on a combination of text and link analysis for distilling authoritative Web resources.
- To their knowledge, this is the first time link and text analysis have been combined for this purpose.
- The authors described the selection of the algorithm parameters, and a user study on selected topics comparing their lists to those of Yahoo! and Infoseek.
- The user study showed that (1) their automatically-compiled resource lists were almost competitive with, and occasionally better than the (semi-)manually compiled lists; and (2) for many of the users in the study, their "flat list" presentation of the results put us at a disadvantage - the commonest requests were for minor additional annotation that is often easy to automate.
Did you find this useful? Give us your feedback
Citations
14,696 citations
13,327 citations
8,328 citations
7,539 citations
Cites background from "Automatic resource compilation by a..."
...[Attardi et al. 1999; Baker and McCallum 1998; Chakrabarti et al. 1998; McCallum et al. 1998; Mladenić 1998b]), and will be more extensively discussed in Section 9....
[...]
...One possible answer is to switch from an interpretation of Na¨ive Bayes in which documents are events to one in which terms are events [Baker and McCallum 1998; McCallum et al. 1998; Chakrabarti et al. 1998a; Guthrie et al. 1994]....
[...]
2,973 citations
References
14,696 citations
13,327 citations
1,440 citations
494 citations
Related Papers (5)
Frequently Asked Questions (7)
Q2. How do the authors normalize the entries of h and a?
Since many entries of W are larger than one, the entries of h and a may grow as the authors iterate; however, since the authors only need their relative values, the authors normalize after each iteration to keep the entries small.
Q3. How long does it take to compute a resource list?
The iterative computation at the core of the analysis takes about a second for a single resource list, on a variety of modern platforms.
Q4. What are the usual concerns that Web users express while searching for resources?
Web users express while searching for resources, which are related to the notions of recall and precision in the Information Retrieval literature.
Q5. What is the emphasis of the experiment?
The emphasis in their current implementation has not been heavy-duty performance (in that the authors do not envision their system fielding thousands of queries per second and producing answers in real time); instead, the authors focused on the quality of their resource lists.
Q6. What is the important reason why the authors decided to fix k to 5.2.1?
In their case, a very small value of k is sufficient -- and hence the computation can be performed extremely efficiently -- for two reasons.
Q7. What is the purpose of the algorithm?
Their work is oriented in a different direction - namely, to use links as a means of harnessing the latent human annotation in hyper-links so as to broaden a user search and focus on a type of ‘high-quality’ page.