scispace - formally typeset
Open AccessJournal ArticleDOI

The stochastic approach for link-structure analysis (SALSA) and the TKC effect

Ronny Lempel, +1 more
- Vol. 33, Iss: 1, pp 387-401
TLDR
SALSA, a new stochastic approach for link structure analysis, which examines random walks on graphs derived from the link structure, is presented and it is proved that SALSA is equivalent to a weighted in-degree analysis of the link-structure of World Wide Web subgraphs, making it computationally more efficient than the mutual reinforcement approach.

Content maybe subject to copyright    Report

CICERO/COMPNW2288 24 MARCH 2000
Computer Networks
Elsevier Science B.V., P.O. Box 1930,
1000 BX Amsterdam
Pages: 1 to 15
IN ALL CORRESPONDENCE
CONCERNING THIS PAPER
REFER TO:
COMPNW 2288 / WWW 175
AUTOPAGE PROOF
Message to the author:
This article is being produced via complete electronic text/image integration, to enable full database storage.
It has therefore to meet specific standards regarding structure and presentation.
This proof is an automatically produced, low-resolution laser-printer output. The final product will naturally
meet our high-quality standards regarding resolution and page lay-out.
Please use this proof solely for checking of typesetting/editing and completeness of text and figures. Changes
to the article as accepted for publication will not be considered at this stage.
Elsevier Science N.L.

Computer Networks 00 (2000) COMPNW2288
The stochastic approach for link-structure analysis (SALSA) and the
TKC effect
1
R. Lempel
Ł
,S.Moran
Department of Computer Science, The Technion, Haifa 32000, Israel
Abstract
Today, when searching for information on the World Wide Web, one usually performs a query through a term-based
search engine. These engines return, as the query’s result, a list of Web sites whose contents match the query. For broad
topic queries, such searches often result in a huge set of retrieved documents, many of which are irrelevant to the user.
However, much information is contained in the link-structure of the World Wide Web. Information such as which pages
are linked to others can be used to augment search algorithms. In this context, Jon Kleinberg introduced the notion of two
distinct types of Web sites: hubs and authorities. Kleinberg argued that hubs and authorities exhibit a mutually reinforcing
relationship: a good hub will point to many authorities, and a good authority will be pointed at by many hubs. In light
of this, he devised an algorithm aimed at finding authoritative sites. We present SALSA, a new stochastic approach for
link structure analysis, which examines random walks on graphs derived from the link structure. We show that both
SALSA and Kleinberg’s mutual reinforcement approach employ the same meta-algorithm. We then prove that SALSA is
equivalent to a weighted in-degree analysis of the link-structure of World Wide Web subgraphs, making it computationally
more efficient than the mutual reinforcement approach. We compare the results of applying SALSA to the results derived
through Kleinberg’s approach. These comparisons reveal a topological phenomenon called the TKC effect (Tightly Knit
Community) which, in certain cases, prevents the mutual reinforcement approach from identifying meaningful authorities.
2000 Published by Elsevier Science B.V. All rights reserved.
Keywords: Information retrieval; Link structure analysis; Hubs and authorities; Random walks; SALSA
1. Introduction
Searching the World Wide Web the challenge. The
World Wide Web is a rapidly expanding hyper-
linked collection of unstructured information. The
lack of structure and the enormous volume of the
World Wide Web pose tremendous challenges on
the World Wide Web information retrieval systems
Ł
Corresponding author. E-mail: {rlempel, moran}@cs.technion.
ac.il
1
Abridged version
called search engines. These search engines are pre-
sented with queries, and return a list of Web sites
which are deemed (by the engine) to pertain to the
query.
When considering the difficulties which World
Wide Web search engines face, we distinguish be-
tween narrow-topic queries and broad-topic queries.
This distinction pertains to the presence which the
query’s topic has on the Web. Narrow topic queries
are queries for which very few resources exist on the
Web, and which present a ‘needle in the haystack
challenge for search engines. An example for such a
1389-1268/00/$ see front matter 2000 Published by Elsevier Science B.V. All rights reserved.
PII: S1389-1268(00)00034-7

COMPNW2288 / WWW 175
2 R. Lempel, S. Moran / Computer Networks 00 (2000) 1–15
query is an attempt to locate the lyrics of a specific
song, by quoting a line from it (‘We all live in a yel-
low submarine’). Search engines encounter a recall
challenge when handling such queries: finding the
few resources which pertain to the query.
On the other hand, broad-topic queries pertain to
topics for which there is an abundance of informa-
tion on the Web, sometimes as many as millions
of relevant resources (with varying degrees of rele-
vance). The vast majority of users are not interested
in retrieving the entire huge set of resources. Most
users will be quite satisfied with a few authoritative
results: Web sites which are highly relevant to the
topic of the query, significantly more than most other
sites. The challenge which search engines face here
is one of precision: retrieving only the most relevant
resources to the query.
This work focuses on nding authoritative re-
sources which pertain to broad-topic queries.
Term-based search engines. Term-based search en-
gines face both classical problems in informa-
tion retrieval, as well as problems specific to the
World Wide Web setting, when handling broad-topic
queries. The classic problems include the following
issues [4,20].
ž Synonymy retrieving documents containing
the term ‘car’ when given the query ‘automobile’.
ž Polysemy=ambiguity when given the query
‘Jordan’, should the engine retrieve pages pertain-
ing to the Hashemite Kingdom of Jordan, or pages
pertaining to basketball legend Michael Jordan?
ž Authorship styles this is a generalization of
the synonymy issue. Two documents, which per-
tain to the same topic, can sometimes use very
different vocabularies and gures of speech when
written by different authors (as an example, the
styles of two documents, one written in British
English and the other in American English, might
differ considerably).
In addition to the classical issues in informa-
tion retrieval, there is a Web-specific obstacle which
search engines must overcome, called search engine
persuasion [19]. There may be millions of sites per-
taining in some manner to broad-topic queries, but
most users will only browse through the rst ten
results returned by their favorite search facility. With
the growing economic impact of the World Wide
Web, and the growth of e-commerce, it is crucial
for businesses to have their sites ranked high by
the major search engines. There are quite a few
companies who sell this kind of expertise. They
design Web sites which are tailored to rank high
with specific queries on the major search engines.
These companies research the ranking algorithms
and heuristics of term-based engines, and know how
many keywords to place (and where) in a Web
page so as to improve the page’s ranking (which
directly impacts the page’s visibility). A less so-
phisticated technique, used by some site creators, is
called keyword spamming [4]. Here, the authors re-
peat certain terms (some of which are only remotely
connected to their site’s context), in order to ‘lure’
search engines into ranking them highly for many
queries.
Informative link structure the answer? The World
Wide Web is a hyperlinked collection. In addition to
the textual content of the individual pages, the link
structure of such collections contains information
which can, and should, be tapped when searching
for authoritative sources. Consider the significance
of a link p ! q: with such a link p suggests,
or even recommends, that surfers visiting p follow
the link and visit q. This may reflect the fact that
pages p and q share a common topic of interest, and
that the author of p thinks highly of qs contents.
Such a link, called an informative link,isp’s way
to confer authority on q [16]. Note that informative
links provide a positive critical assessment of q’s
contents which originates from outside the control
of the author of q (as opposed to assessments based
on qs textual content, which is under complete
control of qs author). This makes the information
extracted from informative links less vulnerable to
manipulative techniques such as spamming.
Unfortunately, not all links are informative. There
are many kinds of links which confer little or no au-
thority [4], such as intra-domain (inner) links (whose
purpose is to provide navigational aid in a complex
Web site of some organization), commercial=sponsor
links, and links which result from link-exchange
agreements. A crucial task which should be com-
pleted prior to analyzing the link structure of a given
collection, is to filter out as many of the non-infor-
mative links as possible.

COMPNW2288 / WWW 175
R. Lempel, S. Moran / Computer Networks 00 (2000) 1–15 3
Related work on link structures. Prior to the World
Wide Web age, link structures were studied in the
area of bibliometrics, which studies the citation
structure of written documents [15,23]. Many works
in this area were aimed at finding high-impact papers
published in scientific journals [10], and at clustering
related documents [1].
Some works have studied the Webs link structure,
in addition to the textual content of the pages, as
means to visualize areas thought to contain good
resources [3]. Other works used link structures for
categorizing pages and clustering them [21,24].
Marchiori [19] uses the link-structure of the Web
to enhance search results of term-based search en-
gines. This is done by considering the potential
hyper-information contained in each Web page: the
information that can be found when following hyper-
links which originate in the page.
This work is motivatedby the approach introduced
by Jon Kleinberg [16]. In an attempt to impose some
structure on the chaotic World Wide Web, Kleinberg
distinguished between two types of Web sites which
pertain to a certain topic. The first are authoritative
pages in the sense described previously. The second
type of sites are hub pages. Hubs are resource lists.
They do not directly contain information pertaining to
the topic, but rather point to many authoritative sites.
According to this model, hubs and authorities exhibit
a mutually reinforcing relationship: good hubs point
to many good authorities, and good authorities are
pointed at by many good hubs.
In light of the mutually reinforcing relation-
ship, hubs and authorities should form communities,
which can be pictured as dense bipartite portions
of the Web, where the hubs link densely to the
authorities. The most prominent community in a
World Wide Web subgraph is called the principal
community of the collection. Kleinberg suggested an
algorithm to identify these communities, which is
describedindetailinSection2.
Researchers from IBM’s Almaden Research Cen-
ter have implemented Kleinberg’s algorithm in vari-
ous projects. The rst was HITS, which is described
in [11], and offers some enlightening practical re-
marks. The ARC system, described in [7], augments
Kleinberg’s link-structure analysis by considering
also the anchor text, the text which surrounds the
hyperlink in the pointing page. The reasoning behind
this is that many times, the pointing page describes
the destination page’s contents around the hyper-
link, and thus the authority conferred by the links
can be better assessed. These projects were extended
by the CLEVER project [14]. Researchers from out-
side IBM, such as Henzinger and Brahat, have also
studied Kleinberg’s approach and have proposed im-
provements to it [13].
Anchor text has also been used by Brin and Page
in [2]. Another major feature of their work on the
Google search engine [12] is a link-structure based
site ranking approach called PageRank, which can be
interpreted as a stochastic analysis of some random-
walk behavior through the entire World Wide Web.
In [18], the authors use the links surrounding a
small set of same-topic sites to assemble a larger col-
lection of neighboring pages which should contain
many authoritative resources on the initial topic. The
textual content of the collection is then analyzed in
ranking the relevancy of its individual pages.
This work. While preserving the theme that Web
sites pertaining to a given topic should be split to
hubs and authorities, we replace Kleinberg’s mutual
reinforcement approach [16] by a new stochastic
approach (SALSA), in which the coupling between
hubs and authorities is less tight. The intuition be-
hind our approach is the following. Consider a bi-
partite graph G, whose two parts correspond to hubs
and authorities, where an edge between hub r and
authority s means that there is an informative link
from r to s. Then, authorities and hubs pertaining to
the dominant topic of the sites in G should be highly
visible (reachable) from many sites in G. Thus, we
will attempt to identify these sites by examining cer-
tain random walks in G, under the proviso that such
random walks will tend to visit these highly visi-
ble sites more frequently than other, less connected
sites. We show that in finding the principal com-
munities of hubs and authorities, both Kleinberg’s
mutual reinforcement approach and our stochastic
approach employ the same meta-algorithm on dif-
ferent representations of the input graph. We then
compare the results of applying SALSA to the re-
sults derived by Kleinberg’s approach. Through these
comparisons, we isolate a particular topological phe-
nomenon which we call the Tightly Knit Community
(TKC) effect. In certain scenarios, this effect hampers

COMPNW2288 / WWW 175
4 R. Lempel, S. Moran / Computer Networks 00 (2000) 115
the ability of the mutual reinforcement approach to
identify meaningful authorities. We demonstrate that
SALSA is less vulnerable to the TKC effect, and can
nd meaningful authorities in collections where the
mutual reinforcement approach fails to do so.
After demonstrating some results achieved by
means of SALSA, we prove that the ranking of sites
in the stochastic approach may be calculated by ex-
amining the weighted in=out degrees of the sites in
G. This result yields that SALSA is computationally
lighter than the mutual reinforcement approach. We
also discuss the reason for our success with analyz-
ing weighted in=out degrees of sites, which previous
work has claimed to be unsatisfactory for identifying
authoritative sites.
The rest of the paper is organized as follows.
Section 2 recounts Kleinbergs mutual reinforcement
approach. In Section 3 we view Kleinbergs approach
from a higher level, and dene a meta-algorithm
for link structure analysis. Section 4 presents our
new approach, SALSA. In Section 5 we compare
the two approaches by considering their outputs on
the World Wide Web and on articial topologies.
Then, in Section 6 we prove the connection between
SALSA and weighted in=out degree rankings of
sites. Our conclusions and ideas for future work are
brought in Section 7. The paper uses basic results
from the theory of stochastic processes, which are
brought in the full version. The main contribution of
the paper can be grasped without following the full
mathematical analysis.
2. Kleinberg’s mutual reinforcement approach
The mutual reinforcement approach [16] starts
by assembling a collection
C of Web sites, which
should contain communities of hubs and authorities
pertaining to a given topic t. It then analyzes the link
structure induced by that collection, in order to nd
the authoritative sites on topic t.
Denote by q a term-based search query to which
sites in our topic of interest t are deemedto berelevant.
Thecollection
Cisassembled inthe followingmanner.
ž A root set S of sites is obtained by applying a
term-based search engine, such as AltaVista [8],
to the query q. This is the only step in which the
lexical content of the Web sites is examined.
ž From S we derive a base set
C which consists of
(a) sites in the root set S, (b) sites which point to
a site in S, and (c) sites which are pointed to by
a site in S. In order to obtain (b), we must again
use a search engine. Many search engines store
linkage information, and support queries such as
which sites point to [a given URL].
The collection
C and its link structure induce the
following directed graph G: Gs nodes are the sites
in
C,andforalli; j 2 C the directed edge i ! j
appears in G if and only if site i contains a hyperlink
to site j .LetW denote the j
CjðjCj adjacency matrix
of G.
Each site s 2
C is now assigned a pair of weights,
a hub weight h.s/ and an authority weight a.s/,
based on the following two principles:
ž The quality of a hub is determined by the quality
of the authorities it points at. Specically, a sites
hub weight should be proportional to the sum of
the authority weights of the sites it points at.
ž Authority lies in the eyes of the beholder(s):
a site is authoritative only if good hubs deem
it as such. Hence, a sites authority weight is
proportional to the sum of the hub weights of the
sites pointing at it.
The top ranking sites, according to both kinds
of weights, form the mutually reinforcing commu-
nities of hubs and authorities. In order to assign
such weights, Kleinberg uses the following iterative
algorithm:
(1) Initialize a.s/ 1, h.s/ 1 for all sites s 2
C.
(2) Repeat the following three operations until con-
vergence:
ž Update the authority weight of each site s (the
I operation):
a.s/
X
xjx pointsto s
h.x/
ž Update the hub weight of each site s (the
O
operation):
h.s/
X
xjs pointsto x
a.x/
ž Normalize the authority weights and the hub
weights.
Note that applying the
I operation is equivalent to
assigning authority weights according to the result
of multiplying the vector of all hub weights by

Citations
More filters
Book

Google's PageRank and Beyond: The Science of Search Engine Rankings

TL;DR: Any business seriously interested in improving its rankings in the major search engines can benefit from the clear examples, sample code, and list of resources provided.
Journal ArticleDOI

Link mining: a survey

TL;DR: While network analysis has been studied in depth in particular areas such as social network analysis, hypertext mining, and web analysis, only recently has there been a cross-fertilization of ideas among these different communities.
Journal ArticleDOI

Vital nodes identification in complex networks

TL;DR: In this paper, the state-of-the-art algorithms for vital node identification in real networks are reviewed and compared, and extensive empirical analyses are provided to compare well-known methods on disparate real networks.
Journal ArticleDOI

Deeper Inside PageRank

TL;DR: A comprehensive survey of all issues associated with PageRank, covering the basic PageRank model, available and recommended solution methods, storage issues, existence, uniqueness, and convergence properties, possible alterations to the basic model, and suggested alternatives to the traditional solution methods.
Journal ArticleDOI

Co-authorship networks in the digital library research community

TL;DR: In this paper, the authors examined the state of the DL domain after a decade of activity by applying social network analysis to the co-authorship network of the past ACM, IEEE, and joint ACM/IEEE digital library conferences.
References
More filters
Journal ArticleDOI

The anatomy of a large-scale hypertextual Web search engine

TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Journal Article

The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Sergey Brin, +1 more
- 01 Jan 1998 - 
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.
Book

Theory of matrices

Journal ArticleDOI

Co-citation in the scientific literature: A new measure of the relationship between two documents

TL;DR: A new form of document coupling called co-citation is defined as the frequency with which two documents are cited together, and clusters of co- cited papers provide a new way to study the specialty structure of science.
Journal ArticleDOI

Citation analysis as a tool in journal evaluation.

Eugene Garfield
- 03 Nov 1972 - 
TL;DR: In 1971, the Institute for Scientfic Information decided to undertake a systematic analysis of journal citation patterns across the whole of science and technology.
Frequently Asked Questions (9)
Q1. What are the future works in this paper?

The following issues are left for future research. There may be better ways to combine these two factors into a single score. Is there some simple property which will allow us to calculate the mutual reinforcement ranking without approximating the principal eigenvector of W TW ? If not, can the authors alter the graph G in some simple manner ( for instance, by changing some weights on the edges ) so that the stochastic ranking on the modified graph will be approximately equal to the mutual reinforcement ranking on the original graph ? 

In this context, Jon Kleinberg introduced the notion of two distinct types of Web sites: hubs and authorities. The authors present SALSA, a new stochastic approach for link structure analysis, which examines random walks on graphs derived from the link structure. The authors show that both SALSA and Kleinberg ’ s mutual reinforcement approach employ the same meta-algorithm. The authors then prove that SALSA is equivalent to a weighted in-degree analysis of the link-structure of World Wide Web subgraphs, making it computationally more efficient than the mutual reinforcement approach. These comparisons reveal a topological phenomenon called the TKC effect ( Tightly Knit Community ) which, in certain cases, prevents the mutual reinforcement approach from identifying meaningful authorities. 

It is important to keep in mind the main goal of broad-topic World Wide Web searches, which is to enhance the precision at 10 of the results, not to rank the entire collection of sites correctly. 

By the ergodic theorem [9], the principal eigenvector of an irreducible, aperiodic stochastic matrix is actually the stationary distribution of the underlying Markov chain, and its high entries correspond to sites most frequently visited by the (infinite) random walk. 

The random walk on A, governed by the transition matrix PA and started from all states with equal probability, will converge to a stationary distribution as follows: limn!1 ePnA D Q³ where Q³ j D jAc. j/j jAj ³ c. j/ jProof. 

In order to assign such weights, Kleinberg uses the following iterative algorithm: (1) Initialize a.s/ 1, h.s/ 1 for all sites s 2 C. (2) Repeat the following three operations until con-vergence: ž Update the authority weight of each site s (the The authoroperation):a.s/ Xxjx points to s h.x/ž 

This mathematical analysis, in addition to providing insight about the ranking that is produced by SALSA, also suggests a very simple algorithm for calculating the stochastic ranking: simply calculate, for all sites, the sum of weights on their incoming (outgoing) edges, and normalize these two vectors. 

There are many kinds of links which confer little or no authority [4], such as intra-domain (inner) links (whose purpose is to provide navigational aid in a complex Web site of some organization), commercial=sponsor links, and links which result from link-exchange agreements. 

Given a topic t , construct a site collection C whichshould contain many t-hubs and t-authorities, but should not contain many hubs or authorities for any other topic t 0. Let n D jCj. ž Derive, from C and the link structure induced byit, two nð n association matrices: a hub matrix H and an authority matrix A. Association matrices are widely used in classification algorithms [22] and will be used here in order to classify the Web sites into communities of hubs=authorities.