scispace - formally typeset
Open AccessJournal ArticleDOI

Strong Regularities in World Wide Web Surfing

Bernardo A. Huberman, +3 more
- 03 Apr 1998 - 
- Vol. 280, Iss: 5360, pp 95-97
Reads0
Chats0
TLDR
A model that assumes that users make a sequence of decisions to proceed to another page, continuing as long as the value of the current page exceeds some threshold, yields the probability distribution for the number of pages that a user visits within a given Web site.
Abstract
One of the most common modes of accessing information in the World Wide Web is surfing from one document to another along hyperlinks. Several large empirical studies have revealed common patterns of surfing behavior. A model that assumes that users make a sequence of decisions to proceed to another page, continuing as long as the value of the current page exceeds some threshold, yields the probability distribution for the number of pages that a user visits within a given Web site. This model was verified by comparing its predictions with detailed measurements of surfing patterns. The model also explains the observed Zipf-like distributions in page hits observed at Web sites.

read more

Content maybe subject to copyright    Report

Huberman et al 1
Strong Regularities in World Wide Web Surfing
Bernardo A. Huberman, Peter L. T. Pirolli, James E. Pitkow and Rajan M. Lukose
Xerox Palo Alto Research Center
3333 Coyote Hill Road
Palo Alto, CA 94304
Abstract
One of the most common modes of accessing information in the World Wide Web
(WWW) is surfing from one document to another along hyperlinks. Several large
empirical studies have revealed common patterns of surfing behavior. A model which
assumes that users make a sequence of decisions to proceed to another page, continuing
as long as the value of the current page exceeds some threshold, yields the probability
distribution for the number of pages, or depth, that a user visits within a Web site. This
model was verified by comparing its predictions with detailed measurements of surfing
patterns. It also explains the observed Zipf-like distributions in page hits observed at
WWW sites.

Huberman et al 2
The exponential growth of World Wide Web (WWW) is making it the standard
information system for an increasing segment of the world's population. From electronic
commerce and information resource to entertainment, the Web allows inexpensive and
fast access to unique and novel services provided by individuals and institutions scattered
throughout the world (1).
In spite of the advantages of this new medium, there are a number of ways in which the
Internet still fails to serve the needs of the user community. Surveys of WWW users find
that slow access and inability to find relevant information are the two most frequently
reported problems (2). The slow access has to do at least partly with congestion problems
(3) whereas the difficulty in finding useful information is related to the balkanization of
the Web structure (4). Since it is hard to solve this fragmentation problem by designing
an effective and efficient classification scheme, an alternative approach is to seek
regularities in user patterns that can then be used to develop technologies for increasing
the density of relevant data for users.
A common way of finding information on the WWW is through query-based search
engines, which allow for quick access to information that is often not the most relevant.
This lack of relevance is partly due to the impossibility of cataloguing an exponentially
growing amount of information in ways that anticipate users’ needs. But since the WWW
is structured as a hypermedia system, in which documents are linked to one another by
authors, it also supports an alternative and effective mode of use, one in which users surf
from one document to another along hypermedia links that appear relevant to their
interests.
In what follows we describe several strong regularities of WWW user surfing patterns
discovered through extensive empirical studies using different user communities. These
regularities can be described by a law of surfing, derived below, which determines the
probability distribution of the number of pages a user visits within a Web site. In
conjunction with a spreading activation algorithm, the law can be used to simulate the
surfing patterns of users on a given Web topology. This leads to accurate predictions of
page hits. Moreover, it explains the observed Zipf-like distributions of page hits to
WWW sites (5).
We start by deriving the probability
P(L)
of the number of links
L
that a user follows in a
Web site. This can be done by considering that there is value in each page a user visits,
and that clicking on the next page assumes that it will be valuable as well. Since the
value of the next page is not certain, one can assume that it is stochastically related to the
previous one. In other words, the value of the current page is the value of the previous
one plus or minus a random term. Thus, the page values can be written as
LLL
VV
ε
+=
1
(1)
where the values
ε
L
are independent and identically distributed Gaussian random
variables. Notice that a particular sequence of page valuations is a realization of a random
process and so is different for each user. Within this formulation, an individual will
continue to surf until the expected cost of continuing is perceived to be larger than the
discounted expected value of the information to be found in the future. This can be

Huberman et al 3
thought of as a real option in financial economics, for which it is well known that there is
a threshold value for exercising the option to continue (6,7). Note that even if the value of
the current page is negative, it may be worthwhile to proceed, since a collection of high
value pages may still be found. If the value is sufficiently negative, however, then it is no
longer worth the risk to continue. That is, when
V
L
falls below some threshold value, it is
optimal to stop.
The number of links a user follows before the page value first reaches the stopping
threshold is a random variable
L
.
For the random walk of Eq. 1 the probability
distribution of first passage times to a threshold is given asymptotically by the two
parameter inverse Gaussian distribution (8)
=
L
L
L
LP
2
2
3
2
)(
exp
2
)(
µ
µλ
π
λ
(2)
with mean
[]
µ
=LE
and variance
λµ
/][
3
=LVar
.
This distribution has two characteristics worth stressing in the context of user surfing
patterns. First, it has a very long tail, which extends much further than that of a normal
distribution with comparable mean and variance. This implies a finite probability for
events that would be unlikely if described by a normal distribution. Consequently, large
deviations from the average number of user clicks computed at a site will be observed.
Second, because of the asymmetry of the distribution function, the typical behavior of
users will not be the same as their average behavior. Thus, since the mode is lower than
the mean, care has to be exercised with available data on the average number of clicks, as
it overestimates the typical depth being surfed.
In order to test the validity of Eq. 2, we performed an analysis of data collected from a
representative sample of America Online (AOL) WWW users. For each day of
November 29, 30, and December 1, 3, and 5, 1997, the entire activity of one of AOL's
caching-proxies was instrumented to record an anonymous but unique user identifier, the
time of each URL request, and the requested URL. For each day in the AOL sample,
there were between 3,247,054 and 9,120,199 requests for Web pages. To compare with
the predicted distribution, a user that starts surfing at a particular site, such as
http://www.sciencemag.org/, is said to have stopped surfing after
L
links as soon as she
requests a page from a different Web site. For this analysis, if the user later returned to
that site a new length count
L
was started. Requests for embedded media such as images
were not counted.
On December 5, 1997, the 23,692 AOL users in our sample made 3,247,054 page
requests from 1,090,168 Web sites. Fig. 1 shows the measured Cumulative Distribution
Function (CDF) of the click length
L
for that day. Superimposed is the predicted one
from the inverse Gaussian distribution fitted by the method of moments (8). To test the
quality of the fit, a Quantile-Quantile was analyzed against the fitted distribution. Both
techniques, along with a study of the regression residuals confirmed the strong fit of the
empirical data to the theoretical distribution. The fit was significant at the
p
< 0.001 level
and accounted for 99% of the variance. Notice that while the average number of pages

Huberman et al 4
surfed at a site is almost three, typically users only requests one page. Other AOL data
from different dates showed the same strength of fit to the inverse Gaussian with nearly
the same parameters.
For further confirmation of the model, we considered the simplest alternative hypothesis,
in which a user at each page simply conducts an independent Bernoulli trial to make a
stopping decision. This leads to a geometric distribution of click lengths, which was
found to be a poor fit to the data.
We also examined the navigational patterns of the Web user population at Georgia
Institute of Technology for a period of three weeks starting August 3, 1994. The data
were collected from an instrumented version of NCSA's Xmosaic that was deployed
across the student, faculty, and staff of the College of Computing (9). One hundred and
seven users (67% of those) invited chose to participate in the experiment. The
instrumentation of Xmosaic recorded all user interface events. Seventy three percent of
all collected events were navigational, resulting in 31,134 pages requests. As with the
previous experiment, the surfing depth of users was calculated across all visits to each
site for the duration of the study. The combined data has a mean number of clicks of 8.32
and a variance of 2.77. Comparison of the Quantile-Quantile, CDF and a regression
analysis of the observed data against an inverse Gaussian distribution of same mean and
variance confirmed the ability of the law of surfing to fit the data (
R
2
of 0.95,
p
< 0.001).
It is important to note that the model is able to fit surfing behavior using data sets from
diverse user communities, at dramatically different time periods, using different
browsers, and connection speeds.
An interesting implication of the law of surfing can be obtained by taking logarithms on
both sides of Eq. 2. One obtains
0.00
0.25
0.50
0.75
1.00
0 255075100
Clicks
Probability
Experimental
inverse Gaussian
Figure 1. The Cumulative Distribution Function of AOL users as a function of the number o
f
clicks surfing. The observed data were collected on December 5, 1997 from a representative
sample of 23,692 AOL users who made 3,247,054 clicks. The fitted inverse Gaussian distribution
has a mean of
µ
= 2.98 and λ = 6.24.

Huberman et al 5
+
=
π
λ
µ
µλ
2
og
2
)(
og
2
3
)(og
2
2
l
L
L
LlLPl
(3)
That is, on log-log plot one observes a straight line whose slope approximates 3/2 for
small values of
L
and large values of the variance. As
L
gets larger, the second term
provides a downward correction. Thus Eq. 3 implies that, up to a constant given by the
third term, the probability of finding a group surfing at a given level scales inversely in
proportion to its depth,
2/3
)(
LLP
α
. This Pareto scaling relation was verified by plotting
the available data on a logarithmic scale. Fig. 2 shows that for a range of click lengths the
inverse proportionality holds well.
The previous data validated the law of surfing for a population of users who had no
constraints on the Web sites they visited. We also considered the case of surfing within a
single large Web site, which is important from the point of view of site design. The site
used was the Xerox Corporation’s external WWW site (http://www.xerox.com). During
the period of August 23 through August 30, 1997, the Xerox site consisted of 8,432
HTML documents and received an average of 165,922 requests per day. The paths of
individual users were reconstructed by a set of heuristics that used unique identifiers, i.e.,
cookies, when present, or otherwise used the topology of the site along with other
information to disambiguate users behind proxies. Automatic programs that request the
entire contents of the site, a.k.a. spiders, were removed from the analysis. Additionally, a
stack-based history mechanism was used to infer pages cached either by the client or by
intermediary caches. This resulted in a data set consisting in the full path of users and the
number of clicks performed at the Xerox Web site.
Fig. 3 shows the Cumulative Distribution Function plot of the Xerox WWW site for
August 26, 1997 against the fitted inverse Gaussian defined by Eq. 2. The mean number
1
10
100
1000
1 10 100 1000
log(Clicks)
log(Frequency)
Figure 2. Figure 2. The frequency distribution of surfing clicks on log-log scales. Data collected
fr
o
m
t
h
e
Geo
r
g
i
a
In
st
i
tute
o
f T
ec
hn
o
l
ogy,
A
ugust
1
99
4
.

Citations
More filters
Journal ArticleDOI

Emergence of Scaling in Random Networks

TL;DR: A model based on these two ingredients reproduces the observed stationary scale-free distributions, which indicates that the development of large networks is governed by robust self-organizing phenomena that go beyond the particulars of the individual systems.
Journal ArticleDOI

Statistical mechanics of complex networks

TL;DR: In this paper, a simple model based on the power-law degree distribution of real networks was proposed, which was able to reproduce the power law degree distribution in real networks and to capture the evolution of networks, not just their static topology.
Journal ArticleDOI

Authoritative sources in a hyperlinked environment

TL;DR: This work proposes and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of “hub pages” that join them together in the link structure, and has connections to the eigenvectors of certain matrices associated with the link graph.
Journal ArticleDOI

Evolution of networks

TL;DR: The recent rapid progress in the statistical physics of evolving networks is reviewed, and how growing networks self-organize into scale-free structures is discussed, and the role of the mechanism of preferential linking is investigated.
Journal ArticleDOI

Classes of small-world networks

TL;DR: Evidence of the occurrence of three classes of small-world networks, characterized by a vertex connectivity distribution that decays as a power law law, and the nature of such constraints may be the controlling factor for the emergence of different classes of networks are presented.
References
More filters
Book

Investment Under Uncertainty

TL;DR: In this article, Dixit and Pindyck provide the first detailed exposition of a new theoretical approach to the capital investment decisions of firms, stressing the irreversibility of most investment decisions, and the ongoing uncertainty of the economic environment in which these decisions are made.
Journal ArticleDOI

Characterizing browsing strategies in the World-Wide Web

TL;DR: A study conducted at Georgia Institute of Technology that captured client-side user events of NCSA's XMosaic supplemented the understanding of user navigation strategies as well as provided real interface usage data.
Proceedings ArticleDOI

Silk from a sow's ear: extracting usable structures from the Web

TL;DR: This paper presents the exploration into techniques that utilize both the topology and textual similarity between items as well as usage data collected by servers and page meta-information lke title and size.
Book

The Inverse Gaussian Distribution: Statistical Theory and Applications

V. Seshadri
TL;DR: In this article, the authors propose a statistical theory for distribution theory and estimate the Slug Length of Length in Pipeline (LLL) and Small Area Estimation (SSE) of pipelines.
Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "Strong regularities in world wide web surfing" ?

In this paper, a model which assumes that users make a sequence of decisions to proceed to another page, continuing as long as the value of the current page exceeds some threshold, yields the probability distribution for the number of pages, or depth, that a user visits within a Web site.