What are the future works in this paper?

The following issues are left for future research. There may be better ways to combine these two factors into a single score. Is there some simple property which will allow us to calculate the mutual reinforcement ranking without approximating the principal eigenvector of W TW ? If not, can the authors alter the graph G in some simple manner ( for instance, by changing some weights on the edges ) so that the stochastic ranking on the modified graph will be approximately equal to the mutual reinforcement ranking on the original graph ?

What are the contributions in this paper?

In this context, Jon Kleinberg introduced the notion of two distinct types of Web sites: hubs and authorities. The authors present SALSA, a new stochastic approach for link structure analysis, which examines random walks on graphs derived from the link structure. The authors show that both SALSA and Kleinberg ’ s mutual reinforcement approach employ the same meta-algorithm. The authors then prove that SALSA is equivalent to a weighted in-degree analysis of the link-structure of World Wide Web subgraphs, making it computationally more efficient than the mutual reinforcement approach. These comparisons reveal a topological phenomenon called the TKC effect ( Tightly Knit Community ) which, in certain cases, prevents the mutual reinforcement approach from identifying meaningful authorities.

What is the main goal of broad-topic searches?

It is important to keep in mind the main goal of broad-topic World Wide Web searches, which is to enhance the precision at 10 of the results, not to rank the entire collection of sites correctly.

What is the principal eigenvector of a aperiodic matrix?

By the ergodic theorem [9], the principal eigenvector of an irreducible, aperiodic stochastic matrix is actually the stationary distribution of the underlying Markov chain, and its high entries correspond to sites most frequently visited by the (infinite) random walk.

What is the probability of a random walk on A?

The random walk on A, governed by the transition matrix PA and started from all states with equal probability, will converge to a stationary distribution as follows: limn!1 ePnA D Q³ where Q³ j D jAc. j/j jAj ³ c. j/ jProof.

What is the simplest way to assign authority weights to sites?

In order to assign such weights, Kleinberg uses the following iterative algorithm: (1) Initialize a.s/ 1, h.s/ 1 for all sites s 2 C. (2) Repeat the following three operations until con-vergence: ž Update the authority weight of each site s (the The authoroperation):a.s/ Xxjx points to s h.x/ž

What is the simplest algorithm for calculating the stochastic ranking?

This mathematical analysis, in addition to providing insight about the ranking that is produced by SALSA, also suggests a very simple algorithm for calculating the stochastic ranking: simply calculate, for all sites, the sum of weights on their incoming (outgoing) edges, and normalize these two vectors.

What is the corresponding eigenvector of a hub matrix?

Given a topic t , construct a site collection C whichshould contain many t-hubs and t-authorities, but should not contain many hubs or authorities for any other topic t 0. Let n D jCj. ž Derive, from C and the link structure induced byit, two nð n association matrices: a hub matrix H and an authority matrix A. Association matrices are widely used in classification algorithms [22] and will be used here in order to classify the Web sites into communities of hubs=authorities.

(Open Access) The stochastic approach for link-structure analysis (SALSA) and the TKC effect (2000) | Ronny Lempel

Q: What kinds of links are used by the Web?

There are many kinds of links which confer little or no authority [4], such as intra-domain (inner) links (whose purpose is to provide navigational aid in a complex Web site of some organization), commercial=sponsor links, and links which result from link-exchange agreements.

CICERO/COMPNW2288 24 MARCH 2000

Computer Networks

Elsevier Science B.V., P.O. Box 1930,

1000 BX Amsterdam

Pages: 1 to 15

IN ALL CORRESPONDENCE

CONCERNING THIS PAPER

REFER TO:

COMPNW 2288 / WWW 175

AUTOPAGE PROOF

Message to the author:

This article is being produced via complete electronic text/image integration, to enable full database storage.

It has therefore to meet speciﬁc standards regarding structure and presentation.

This proof is an automatically produced, low-resolution laser-printer output. The ﬁnal product will naturally

meet our high-quality standards regarding resolution and page lay-out.

Please use this proof solely for checking of typesetting/editing and completeness of text and ﬁgures. Changes

to the article as accepted for publication will not be considered at this stage.

Elsevier Science N.L.

Computer Networks 00 (2000) COMPNW2288

The stochastic approach for link-structure analysis (SALSA) and the

TKC effect

R. Lempel

,S.Moran

Department of Computer Science, The Technion, Haifa 32000, Israel

Abstract

Today, when searching for information on the World Wide Web, one usually performs a query through a term-based

search engine. These engines return, as the query’s result, a list of Web sites whose contents match the query. For broad

topic queries, such searches often result in a huge set of retrieved documents, many of which are irrelevant to the user.

However, much information is contained in the link-structure of the World Wide Web. Information such as which pages

are linked to others can be used to augment search algorithms. In this context, Jon Kleinberg introduced the notion of two

distinct types of Web sites: hubs and authorities. Kleinberg argued that hubs and authorities exhibit a mutually reinforcing

relationship: a good hub will point to many authorities, and a good authority will be pointed at by many hubs. In light

of this, he devised an algorithm aimed at ﬁnding authoritative sites. We present SALSA, a new stochastic approach for

link structure analysis, which examines random walks on graphs derived from the link structure. We show that both

SALSA and Kleinberg’s mutual reinforcement approach employ the same meta-algorithm. We then prove that SALSA is

equivalent to a weighted in-degree analysis of the link-structure of World Wide Web subgraphs, making it computationally

more efﬁcient than the mutual reinforcement approach. We compare the results of applying SALSA to the results derived

through Kleinberg’s approach. These comparisons reveal a topological phenomenon called the TKC effect (Tightly Knit

Community) which, in certain cases, prevents the mutual reinforcement approach from identifying meaningful authorities.

Keywords: Information retrieval; Link structure analysis; Hubs and authorities; Random walks; SALSA

1. Introduction

Searching the World Wide Web — the challenge. The

World Wide Web is a rapidly expanding hyper-

linked collection of unstructured information. The

lack of structure and the enormous volume of the

World Wide Web pose tremendous challenges on

the World Wide Web information retrieval systems

Corresponding author. E-mail: {rlempel, moran}@cs.technion.

ac.il

Abridged version

called search engines. These search engines are pre-

sented with queries, and return a list of Web sites

which are deemed (by the engine) to pertain to the

query.

When considering the difﬁculties which World

Wide Web search engines face, we distinguish be-

tween narrow-topic queries and broad-topic queries.

This distinction pertains to the presence which the

query’s topic has on the Web. Narrow topic queries

are queries for which very few resources exist on the

Web, and which present a ‘needle in the haystack’

challenge for search engines. An example for such a

PII: S1389-1268(00)00034-7

COMPNW2288 / WWW 175

2 R. Lempel, S. Moran / Computer Networks 00 (2000) 1–15

query is an attempt to locate the lyrics of a speciﬁc

song, by quoting a line from it (‘We all live in a yel-

low submarine’). Search engines encounter a recall

challenge when handling such queries: ﬁnding the

few resources which pertain to the query.

On the other hand, broad-topic queries pertain to

topics for which there is an abundance of informa-

tion on the Web, sometimes as many as millions

of relevant resources (with varying degrees of rele-

vance). The vast majority of users are not interested

in retrieving the entire huge set of resources. Most

users will be quite satisﬁed with a few authoritative

results: Web sites which are highly relevant to the

topic of the query, signiﬁcantly more than most other

sites. The challenge which search engines face here

is one of precision: retrieving only the most relevant

resources to the query.

This work focuses on ﬁnding authoritative re-

sources which pertain to broad-topic queries.

Term-based search engines. Term-based search en-

gines face both classical problems in informa-

tion retrieval, as well as problems speciﬁc to the

World Wide Web setting, when handling broad-topic

queries. The classic problems include the following

issues [4,20].

ž Synonymy — retrieving documents containing

the term ‘car’ when given the query ‘automobile’.

ž Polysemy=ambiguity — when given the query

‘Jordan’, should the engine retrieve pages pertain-

ing to the Hashemite Kingdom of Jordan, or pages

pertaining to basketball legend Michael Jordan?

ž Authorship styles — this is a generalization of

the synonymy issue. Two documents, which per-

tain to the same topic, can sometimes use very

different vocabularies and ﬁgures of speech when

written by different authors (as an example, the

styles of two documents, one written in British

English and the other in American English, might

differ considerably).

In addition to the classical issues in informa-

tion retrieval, there is a Web-speciﬁc obstacle which

search engines must overcome, called search engine

persuasion [19]. There may be millions of sites per-

taining in some manner to broad-topic queries, but

most users will only browse through the ﬁrst ten

results returned by their favorite search facility. With

the growing economic impact of the World Wide

Web, and the growth of e-commerce, it is crucial

for businesses to have their sites ranked high by

the major search engines. There are quite a few

companies who sell this kind of expertise. They

design Web sites which are tailored to rank high

with speciﬁc queries on the major search engines.

These companies research the ranking algorithms

and heuristics of term-based engines, and know how

many keywords to place (and where) in a Web

page so as to improve the page’s ranking (which

directly impacts the page’s visibility). A less so-

phisticated technique, used by some site creators, is

called keyword spamming [4]. Here, the authors re-

peat certain terms (some of which are only remotely

connected to their site’s context), in order to ‘lure’

search engines into ranking them highly for many

queries.

Informative link structure — the answer? The World

Wide Web is a hyperlinked collection. In addition to

the textual content of the individual pages, the link

structure of such collections contains information

which can, and should, be tapped when searching

for authoritative sources. Consider the signiﬁcance

of a link p ! q: with such a link p suggests,

or even recommends, that surfers visiting p follow

the link and visit q. This may reﬂect the fact that

pages p and q share a common topic of interest, and

that the author of p thinks highly of q’s contents.

Such a link, called an informative link,isp’s way

to confer authority on q [16]. Note that informative

links provide a positive critical assessment of q’s

contents which originates from outside the control

of the author of q (as opposed to assessments based

on q’s textual content, which is under complete

control of q’s author). This makes the information

extracted from informative links less vulnerable to

manipulative techniques such as spamming.

Unfortunately, not all links are informative. There

are many kinds of links which confer little or no au-

thority [4], such as intra-domain (inner) links (whose

purpose is to provide navigational aid in a complex

Web site of some organization), commercial=sponsor

links, and links which result from link-exchange

agreements. A crucial task which should be com-

pleted prior to analyzing the link structure of a given

collection, is to ﬁlter out as many of the non-infor-

mative links as possible.

COMPNW2288 / WWW 175

R. Lempel, S. Moran / Computer Networks 00 (2000) 1–15 3

Related work on link structures. Prior to the World

Wide Web age, link structures were studied in the

area of bibliometrics, which studies the citation

structure of written documents [15,23]. Many works

in this area were aimed at ﬁnding high-impact papers

published in scientiﬁc journals [10], and at clustering

related documents [1].

Some works have studied the Web’s link structure,

in addition to the textual content of the pages, as

means to visualize areas thought to contain good

resources [3]. Other works used link structures for

categorizing pages and clustering them [21,24].

Marchiori [19] uses the link-structure of the Web

to enhance search results of term-based search en-

gines. This is done by considering the potential

hyper-information contained in each Web page: the

information that can be found when following hyper-

links which originate in the page.

This work is motivatedby the approach introduced

by Jon Kleinberg [16]. In an attempt to impose some

structure on the chaotic World Wide Web, Kleinberg

distinguished between two types of Web sites which

pertain to a certain topic. The ﬁrst are authoritative

pages in the sense described previously. The second

type of sites are hub pages. Hubs are resource lists.

They do not directly contain information pertaining to

the topic, but rather point to many authoritative sites.

According to this model, hubs and authorities exhibit

a mutually reinforcing relationship: good hubs point

to many good authorities, and good authorities are

pointed at by many good hubs.

In light of the mutually reinforcing relation-

ship, hubs and authorities should form communities,

which can be pictured as dense bipartite portions

of the Web, where the hubs link densely to the

authorities. The most prominent community in a

World Wide Web subgraph is called the principal

community of the collection. Kleinberg suggested an

algorithm to identify these communities, which is

describedindetailinSection2.

Researchers from IBM’s Almaden Research Cen-

ter have implemented Kleinberg’s algorithm in vari-

ous projects. The ﬁrst was HITS, which is described

in [11], and offers some enlightening practical re-

marks. The ARC system, described in [7], augments

Kleinberg’s link-structure analysis by considering

also the anchor text, the text which surrounds the

hyperlink in the pointing page. The reasoning behind

this is that many times, the pointing page describes

the destination page’s contents around the hyper-

link, and thus the authority conferred by the links

can be better assessed. These projects were extended

by the CLEVER project [14]. Researchers from out-

side IBM, such as Henzinger and Brahat, have also

studied Kleinberg’s approach and have proposed im-

provements to it [13].

Anchor text has also been used by Brin and Page

in [2]. Another major feature of their work on the

Google search engine [12] is a link-structure based

site ranking approach called PageRank, which can be

interpreted as a stochastic analysis of some random-

walk behavior through the entire World Wide Web.

In [18], the authors use the links surrounding a

small set of same-topic sites to assemble a larger col-

lection of neighboring pages which should contain

many authoritative resources on the initial topic. The

textual content of the collection is then analyzed in

ranking the relevancy of its individual pages.

This work. While preserving the theme that Web

sites pertaining to a given topic should be split to

hubs and authorities, we replace Kleinberg’s mutual

reinforcement approach [16] by a new stochastic

approach (SALSA), in which the coupling between

hubs and authorities is less tight. The intuition be-

hind our approach is the following. Consider a bi-

partite graph G, whose two parts correspond to hubs

and authorities, where an edge between hub r and

authority s means that there is an informative link

from r to s. Then, authorities and hubs pertaining to

the dominant topic of the sites in G should be highly

visible (reachable) from many sites in G. Thus, we

will attempt to identify these sites by examining cer-

tain random walks in G, under the proviso that such

random walks will tend to visit these highly visi-

ble sites more frequently than other, less connected

sites. We show that in ﬁnding the principal com-

munities of hubs and authorities, both Kleinberg’s

mutual reinforcement approach and our stochastic

approach employ the same meta-algorithm on dif-

ferent representations of the input graph. We then

compare the results of applying SALSA to the re-

sults derived by Kleinberg’s approach. Through these

comparisons, we isolate a particular topological phe-

nomenon which we call the Tightly Knit Community

(TKC) effect. In certain scenarios, this effect hampers

COMPNW2288 / WWW 175

4 R. Lempel, S. Moran / Computer Networks 00 (2000) 1–15

the ability of the mutual reinforcement approach to

identify meaningful authorities. We demonstrate that

SALSA is less vulnerable to the TKC effect, and can

ﬁnd meaningful authorities in collections where the

mutual reinforcement approach fails to do so.

After demonstrating some results achieved by

means of SALSA, we prove that the ranking of sites

in the stochastic approach may be calculated by ex-

amining the weighted in=out degrees of the sites in

G. This result yields that SALSA is computationally

lighter than the mutual reinforcement approach. We

also discuss the reason for our success with analyz-

ing weighted in=out degrees of sites, which previous

work has claimed to be unsatisfactory for identifying

authoritative sites.

The rest of the paper is organized as follows.

Section 2 recounts Kleinberg’s mutual reinforcement

approach. In Section 3 we view Kleinberg’s approach

from a higher level, and deﬁne a meta-algorithm

for link structure analysis. Section 4 presents our

new approach, SALSA. In Section 5 we compare

the two approaches by considering their outputs on

the World Wide Web and on artiﬁcial topologies.

Then, in Section 6 we prove the connection between

SALSA and weighted in=out degree rankings of

sites. Our conclusions and ideas for future work are

brought in Section 7. The paper uses basic results

from the theory of stochastic processes, which are

brought in the full version. The main contribution of

the paper can be grasped without following the full

mathematical analysis.

2. Kleinberg’s mutual reinforcement approach

The mutual reinforcement approach [16] starts

by assembling a collection

C of Web sites, which

should contain communities of hubs and authorities

pertaining to a given topic t. It then analyzes the link

structure induced by that collection, in order to ﬁnd

the authoritative sites on topic t.

Denote by q a term-based search query to which

sites in our topic of interest t are deemedto berelevant.

Thecollection

Cisassembled inthe followingmanner.

ž A root set S of sites is obtained by applying a

term-based search engine, such as AltaVista [8],

to the query q. This is the only step in which the

lexical content of the Web sites is examined.

ž From S we derive a base set

C which consists of

(a) sites in the root set S, (b) sites which point to

a site in S, and (c) sites which are pointed to by

a site in S. In order to obtain (b), we must again

use a search engine. Many search engines store

linkage information, and support queries such as

‘which sites point to [a given URL]’.

The collection

C and its link structure induce the

following directed graph G: G’s nodes are the sites

C,andforalli; j 2 C the directed edge i ! j

appears in G if and only if site i contains a hyperlink

to site j .LetW denote the j

CjðjCj adjacency matrix

of G.

Each site s 2

C is now assigned a pair of weights,

a hub weight h.s/ and an authority weight a.s/,

based on the following two principles:

ž The quality of a hub is determined by the quality

of the authorities it points at. Speciﬁcally, a site’s

hub weight should be proportional to the sum of

the authority weights of the sites it points at.

ž ‘Authority lies in the eyes of the beholder(s)’:

a site is authoritative only if good hubs deem

it as such. Hence, a site’s authority weight is

proportional to the sum of the hub weights of the

sites pointing at it.

The top ranking sites, according to both kinds

of weights, form the mutually reinforcing commu-

nities of hubs and authorities. In order to assign

such weights, Kleinberg uses the following iterative

algorithm:

(1) Initialize a.s/ 1, h.s/ 1 for all sites s 2

(2) Repeat the following three operations until con-

vergence:

ž Update the authority weight of each site s (the

I operation):

a.s/

xjx pointsto s

h.x/

ž Update the hub weight of each site s (the

operation):

h.s/

xjs pointsto x

a.x/

ž Normalize the authority weights and the hub

weights.

Note that applying the

I operation is equivalent to

assigning authority weights according to the result

of multiplying the vector of all hub weights by

The stochastic approach for link-structure analysis (SALSA) and the TKC effect

Figures

Citations

Google's PageRank and Beyond: The Science of Search Engine Rankings

Link mining: a survey

Vital nodes identification in complex networks

Deeper Inside PageRank

Co-authorship networks in the digital library research community

References

The anatomy of a large-scale hypertextual Web search engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Theory of matrices

Co-citation in the scientific literature: A new measure of the relationship between two documents

Citation analysis as a tool in journal evaluation.

Related Papers (5)

Authoritative sources in a hyperlinked environment

The anatomy of a large-scale hypertextual Web search engine

The PageRank Citation Ranking : Bringing Order to the Web

Topic-sensitive PageRank

Graph structure in the Web

Frequently Asked Questions (9)

Q1. What are the future works in this paper?

Q2. What are the contributions in this paper?

Q3. What is the main goal of broad-topic searches?

Q4. What is the principal eigenvector of a aperiodic matrix?

Q5. What is the probability of a random walk on A?

Q6. What is the simplest way to assign authority weights to sites?

Q7. What is the simplest algorithm for calculating the stochastic ranking?

Q8. What kinds of links are used by the Web?

Q9. What is the corresponding eigenvector of a hub matrix?