scispace - formally typeset
Open AccessJournal ArticleDOI

Self-similarity in World Wide Web traffic: evidence and possible causes

Mark Crovella, +1 more
- 01 Dec 1997 - 
- Vol. 5, Iss: 6, pp 835-846
Reads0
Chats0
TLDR
It is shown that the self-similarity in WWW traffic can be explained based on the underlying distributions of WWW document sizes, the effects of caching and user preference in file transfer, the effect of user "think time", and the superimposition of many such transfers in a local-area network.
Abstract
The notion of self-similarity has been shown to apply to wide-area and local-area network traffic. We show evidence that the subset of network traffic that is due to World Wide Web (WWW) transfers can show characteristics that are consistent with self-similarity, and we present a hypothesized explanation for that self-similarity. Using a set of traces of actual user executions of NCSA Mosaic, we examine the dependence structure of WWW traffic. First, we show evidence that WWW traffic exhibits behavior that is consistent with self-similar traffic models. Then we show that the self-similarity in such traffic can be explained based on the underlying distributions of WWW document sizes, the effects of caching and user preference in file transfer, the effect of user "think time", and the superimposition of many such transfers in a local-area network. To do this, we rely on empirically measured distributions both from client traces and from data independently collected at WWW servers.

read more

Content maybe subject to copyright    Report

In Proc. of the 1996 ACM SIGMETRICS Intl. Conference on Measurement and Modeling of Computer Systems, Philadelphia, PA, May 1996
Self-Similarity in World Wide Web Trac
Evidence and Possible Causes
Mark E. Crovella and Azer Bestavros
Computer Science Department
Boston University
Boston,
ma
02215
f
crovella,best
g
@cs.bu.edu
Abstract
Recently the notion of
self-similarity
has been shown to apply
to wide-area and lo cal-area network trac. In this pap er we
examine the mechanisms that give rise to the self-similarity
of network trac. We presentahypothesized explanation for
the possible self-similarity of trac by using a particular sub-
set of wide area trac: trac due to the World Wide Web
(WWW). Using an extensive set of traces of actual user exe-
cutions of NCSA Mosaic, reecting over half a million requests
for WWW do cuments, we examine the dependence structure
of WWW trac. While our measurements are not conclusive,
we show evidence that WWW trac exhibits behavior that
is
consistent with self-similar trac mo dels. Then we show
that the self-similarityinsuch trac can b e explained based
on the underlying distributions of WWW do cument sizes, the
eects of caching and user preference in le transfer, the eect
of user \think time", and the sup erimposition of many such
transfers in a lo cal area network. To do this we rely on empir-
ically measured distributions b oth from our traces and from
data indep endently collected at over thirty WWW sites.
1 Introduction
Understanding the nature of network trac is critical in order
to prop erly design and implement computer networks and net-
work services like the World Wide Web. Recent examinations
of LAN trac 16 ] and wide area network trac 20 ] havechal-
lenged the commonly assumed mo dels for network trac,
e.g.,
the Poisson distribution. Were trac to follow aPoisson or
Markovian arrival process, it would haveacharacteristic burst
length whichwould tend to b e smoothed byaveraging over a
long enough time scale. Rather, measurements of real trac
indicate that signicant trac variance (burstiness) is present
on a wide range of time scales.
Trac that is burstyonman
y or all time scales can be de-
scribed statistically using the notion of
self-similarity
, whichis
This work was supp orted in part by NSF grants CCR-9501822 and
CCR-9308344.
Permission to make digital or hard copies of part or all of this work for
personal or classro om use is granted without fee provided that copies
are not made or distributed for prot or commercial advantage and that
copies bear this notice and the full citation on the rst page. Copy-
rights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to
republish, to p ost on servers or to redistribute to lists, requires prior
specic p ermission and/or a fee.
SIGMETRICS 96-5/96 Philadelphia, PA, USA
c
1996 ACM
a propertywe asso ciate with fractals|ob jects whose app ear-
ance is unchanged regardless of the scale at which they are
viewed. In the case of sto chastic ob jects like timeseries, self-
similarity is used in the distributional sense: when viewed at
varying scales, the ob ject's distribution remains unchanged.
Since a self-similar pro cess has observable bursts on all time
scales, it exhibits
long-range dependence
values at any instant
are typically correlated with all future values. Surprisingly
(given the counterintuitive aspects of long-range dep endence)
the self-similarity of Ethernet network trac has b een rigor-
ously established 16 ].
The imp ortance of long-range dep en-
dence in network trac is beginning to be observed in studies
such as 15], whichshow that packet loss and delay b ehavior is
radically dierentinsimulations using real trac data rather
than traditional network models.
However, the reasons b ehind network trac self-similarity
have not b een clearly identied. In this paper we showthatin
some cases, network trac self-similarity can b e explained in
terms of le system characteristics and user b ehavior. In the
process, we trace the genesis of network trac self-similarity
back from the trac itself, through the actions of le trans-
mission, caching systems, and user choice, to the distributions
of le sizes and user eventinterarrivals.
To bridge the gap b etween studying network trac on one
hand and high-level system characteristics on the other, we
make use of twoessential to ols. First, to explain self-similar
network trac in terms of individual transmission lengths, we
employ the mechanism intro duced in 17] and describ ed in 16 ].
Those pap ers point out that self-similar trac can be con-
structed bymultiplexing a large numb er of ON/OFF sources
that have ON and OFF p erio d lengths that are heavy-tailed,
as dened in Section 2.2. Such a mechanism could correspond
to a network of workstations, eachofwhich is either silentor
transferring data at a constant rate.
Our second tool in bridging the gap b etween transmission
times and high-level system characteristics is our use of the
World Wide Web (WWW or Web) as an ob ject of study.The
Web provides a sp ecial opportunity for studying network traf-
c b ecause it is a \closed" system: all trac arises as the result
of le transfers from an easily studied set, and user activityis
easily monitored.
To study the trac patterns of the WWW we collected
reference data reecting actual WWW use at our site. Wein-
strumented NCSA Mosaic 9] to capture user access patterns
to the Web. Since at the time of our data collection, Mo-
saic was by far the dominant WWW browser at our site, we
were able to capture a fairly complete picture of Web trac
on our local network our dataset consists of more than half
a million user requests for do cument transfers,
and includes

detailed timing of requests and transfer lengths. In addition
we surveyed a number of WWW servers to capture do cument
size information that we used to validate assumptions made in
our analysis.
The pap er takes two parts. First, we consider the possi-
bility of self-similarityofWeb trac for the busiest hours we
measured. Todosowe use analyses very similar to those p er-
formed in 16]. These analyses support the notion that Web
trac mayshow self-similar characteristics, at least when de-
mand is high enough. This result in itself has implications
for designers of systems that attempt to improve performance
characteristics of the WWW.
Second, using our WWW trac, user preference, and le
size data, we comment on reasons why the transmission times
and quiet times for any particular Web session are heavy-
tailed, which is an essential characteristic of the proposed
mechanism for self-similarity of trac. In particular, wear-
gue that manycharacteristics of WWW use can b e modelled
using heavy-tailed distributions, including the distribution of
transfer times, the distribution of user requests for documents,
and the underlying distribution of do cuments sizes available in
the Web. In addition, using our measurements of user inter-
request times, we explore reasons for the heavy-tailed distri-
bution of quiet times needed for self-similarity.
2 Background
2.1 Denition of Self-Similarity
For detailed discussion of self-similarity in timeseries data and
the accompanying statistical tests, see 2, 27]. Our discussion
in this subsection and the next closely follows those sources.
A self-similar time series has the prop erty that when ag-
gregated (leading to a shorter time series in whicheach point
is the sum of multiple original p oints) the new series has the
same auto correlation function as the original. That is, given
a stationary timeseries
X
=(
X
t
t
=0
1
2
:::
), we dene the
m
-aggregated series
X
(
m
)
=(
X
(
m
)
k
:
k
=1
2
3
:::
) by sum-
ming the original series
X
over nonoverlapping blo cks of size
m
. Then if
X
is self-similar, it has the same auto correlation
function
r
(
k
)=
E
(
X
t
;
)(
X
t
+
k
;
)] as the series
X
(
m
)
for
all
m
. Note that this means that the series is
distributional ly
self-similar: the distribution of the aggregated series is the
same (except for changes in scale) as that of the original.
As a result, self-similar pro cesses show
long-range depen-
dence.
A process with long-range dependence has an autocor-
relation function
r
(
k
)
k
;
as
k
!1
, where 0
< <
1.
Thus the auto correlation function of such a pro cess decays
hyperbolically (as compared to the exp onential decayexhib-
ited by traditional trac mo dels). Hyp erbolic decayismuch
slower than exp onential decay, and since
<
1, the sum of
the autocorrelation values of such a series approaches innity.
This has a number of implications. First, the variance of
n
samples from such a series does not decrease as a function of
n
(as predicted by basic statistics for uncorrelated datasets)
but rather bythe value
n
;
. Second, the p ower spectrum of
such a series is hyperb olic, rising to innity at frequency zero
| reecting the \innite" inuence of long-range dependence
in the data.
One of the attractive features of using self-similar mod-
els for time series, when appropriate, is that the degree of
self-similarity of a series is expressed using only a single pa-
rameter. The parameter expresses the sp eed of decay of the
series' autocorrelation function. For historical reasons, the pa-
rameter used is the
Hurst
parameter
H
=1
;
=
2. Thus, for
self-similar series, 1
=
2
< H <
1. As
H
!
1, the degree of
self-similarity increases. Thus the fundamental test for self-
similarity of a series reduces to the question of whether
H
is
signicantly dierent from 1
=
2.
In this pap er we use four metho ds to test for self-similarity.
These metho ds are describ ed fully in 2 ] and are the same
methods describ ed and used in 16]. A summary of the relative
accuracy of these methods on synthetic datasets is presented
in 24].
The rst method, the
variance-time plot,
relies on the slowly
decaying variance of a self-similar series. The variance of
X
(
m
)
is plotted against
m
on a log-log plot a straight line with slop e
(
) greater than -1 is indicative of self-similarity,andthepa-
rameter
H
is given by
H
= 1
;
=
2. The second metho d,
the
R=S
plot, uses the fact that for a self-similar dataset, the
rescaled range
or
R=S
statistic grows according to a power
law with exponent
H
as a function of the number of p oints
included (
n
). Thus the plot of
R=S
against
n
on a log-log plot
has slop e whichisanestimateof
H
. The third approach, the
periodogram
method, uses the slope of the power spectrum of
the series as frequency approaches zero. On a log-log plot, the
perio dogram slope is a straight line with slop e
;
1=1
;
2
H
close to the origin.
While the preceding three graphical methods are useful for
exposing faulty assumptions (such as non-stationarity in the
dataset) they do not provide condence intervals. The fourth
method, called the
Whittle estimator
does provide a condence
interval, but has the drawback that the form of the underly-
ing sto chastic process must b e supplied. The two forms that
are most commonly used are fractional Gaussian noise (FGN)
with parameter 1
=
2
<H <
1, and Fractional ARIMA (
p d q
)
with 0
<d<
1
=
2 (for details see 2, 4]). These two mo dels dif-
fer in their assumptions about the short-range dependences in
the datasets FGN assumes no short-range dependence while
Fractional ARIMA can assume a xed degree of short-range
dependence.
Since we are concerned only with the long-range dep en-
dence of our datasets, we employ the Whittle estimator as
follows. Each hourly dataset is aggregated at increasing levels
m
, and the Whittle estimator is applied to each
m
-aggregated
dataset using the FGN model. The resulting estimates of
H
and condence intervals are plotted as a function of
m
. This
approach exploits the propertythat any long-range dependent
process approaches FGN when aggregated to a sucient level.
As
m
increases short-range dependences are averaged out of
the dataset if the value of
H
remains relatively constantwe
can be condent that it measures a true underlying level of
self-similarity.
2.2 Heavy-Tailed Distributions
The distributions we use in this pap er have the propertyof
being
heavy-tailed.
A distribution is heavy-tailed if
P
X>x
]
x
;
as
x
!1
0
<<
2
:
That is, regardless of the behavior of the distribution for small
values of the random variable, if the asymptotic shap e of the
distribution is hyperb olic, it is heavy-tailed.
The simplest heavy-tailed distribution is the
Pareto
distri-
bution. The Pareto distribution is hyperb olic over its entire
range its probability mass function is
p
(
x
)=
k
x
;
;
1
 k >
0
x
k:
and its cumulative distribution function is given by
F
(
x
)=
P
X
x
]=1
;
(
k=x
)

The parameter
k
represents the smallest p ossible value of the
random variable.
Our results are based on estimating the values of
for
a number of empirically measured distributions, suchasthe
lengths of World Wide Web le transmission times. Todoso,
we employ
log-log complementary distribution
(LLCD) plots.
These are plots of the complementary cumulative distribution
F
(
x
)=1
;
F
(
x
)=
P
X>x
] on log-log axes. Plotted in this
way,heavy-tailed distributions have the property that
d
log
F
(
x
)
d
log
x
=
;
 x>
for some
. In practice we obtain an estimate for
by plotting
the LLCD plot of the dataset and selecting a value for
above
which the plot app ears to b e linear. Then we select equally-
spaced points from among the LLCD points larger than
and
estimate the slop e using least-squares regression. Equally-
spaced p oints are used b ecause the pointdensityvaries over
the range used, and the preponderance of data points near the
median would otherwise unduly inuence the least-squares re-
gression.
In all our
estimates for le sizes weuse
= 1000 meaning
that we consider tails to be the portions of the distributions
for les of 1,000 bytes or greater.
An alternative approach to estimating tail weight, used in
28], is the
Hil l
estimator 11 ]. The Hill estimator do es not give
a single estimate of
, but can b e used to gauge the general
range of
s that are reasonable. We used the Hill estimator
to check that the estimates of
obtained using the LLCD
method were within range in all cases they were.
2.2.1 Testing for Innite Variance
There is evidence that, over their entire range, many of the dis-
tributions we study maybe well characterized using
lognormal
distributions 19]. However, lognormal distributions do not
have innite variance, and hence are not heavy-tailed. In our
work, we are not concerned with distributions over their entire
range|only their tails. As a result we don't use goo dness-of-t
tests to determine whether Pareto or lognormal distributions
are b etter at describing our data. However,itisimportantto
verify that our datasets exhibit the innite variance charac-
teristic of heavy tails. Todosowe use a simple test based on
the Central Limit Theorem (CLT), which states that the sum
of a large numb er of i.i.d. samples from any distribution
with
nite variance
will tend to b e normally distributed. Totest
for innite variance we pro ceed as follows. First, form the
m
-
aggregrated dataset from the original dataset for large values
of
m
(typically in the range 10 to 1000). Next, we inspect the
tail behavior of the aggregated datasets using the LLCD plot.
For datasets with nite variance, the slope will increasingly
decline as
m
increases, reecting the underlying distribution's
approximation of a normal distribution. For datasets with in-
nite variance, the slope will remain roughly constant with
increasing
m
.
An example is shown in Figure 1. The gure shows the CLT
test for aggregation levels of 10, 100, and 500 as applied to two
synthetic datasets. On the left the dataset consists of 10,000
samples from a Pareto distribution with
= 1
:
0. On the
right the dataset consists of 10,000 samples from a lognormal
distribution with
= 2
:
0

= 2
:
0. These parameters were
chosen so as to make the Pareto and lognormal distributions
appear approximately similar for log
10
(
x
) in the range 0 to
4. In each plot the original LLCD plot for the dataset is the
lowermost line the upper lines are the LLCD plots of
the
aggregated datasets. Increasing aggregation level increases the
average value of the p oints in the dataset (since the sums are
not normalized by the new mean) so greater aggregation levels
show up as higher lines in the plot. The gure clearly shows
the qualitative dierence b etween nite and innite variance
datasets. The Pareto dataset is characterized by parallel lines,
while the lognormal dataset is characterized by lines that seem
roughly convergent.
3 Related Work
The rst step in understanding WWW trac is the collec-
tion of trace data. Previous measurement studies of the Web
have fo cused on reference patterns established based on logs
of proxies 10 , 23], or servers 21]. The authors in 5 ] captured
client traces, but they concentrated on events at the user in-
terface level in order to study browser and page design. In
contrast, our goal in data collection was to acquire a complete
picture of the reference b ehavior and timing of user accesses
to the WWW. As a result, we collected a large dataset of
client-based traces. A full description of our traces (whichare
available for anonymous FTP) is given in 8].
Previous wide-area trac studies have studied FTP, TEL-
NET, NNTP, and SMTP trac 19, 20]. Our data comple-
ments those studies byproviding a view of WWW (HTTP)
trac at a \stub" network. In addition, our measurements of
Web le sizes are in general agreement with those reported in
1]. Since WWW trac accounts for more than 25% of the
trac on the Internet and is currently growing more rapidly
than any other trac type 12], understanding the nature of
WWW trac is important and is exp ected to increase in im-
portance.
The b enchmark study of self-similarity in network trac is
14, 16 ], and our study uses many of the same metho ds used
therein. However, the goal of that study was to demonstrate
the self-similarity of network trac to do that, manylarge
datasets taken from a multi-year span were used. Our fo cus is
not on establishing self-similarity of network trac (although
wedoso for the interesting subset of network trac that is
Web-related) instead we concentrate on examining the rea-
sons b ehind that self-similarity. As a result of this dierent
focus, we do not analyze trac datasets for low, normal, and
busy hours. Instead we focus on the four busiest hours in our
logs. While these four hours are self-similar, many less-busy
hours in our logs do not show self-similar characteristics. We
feel that this is only the result of the trac demand presentin
our logs, whichismuchlower than that used in 14, 16] this
belief is supp orted by the ndings in that study, which showed
that the intensity of self-similarity increases as the aggregate
trac level increases.
Our work is most similar in intent to 28]. That paper
looked at network trac at the packet level, identied the ows
between individual source/destination pairs, and showed that
transmission and idle times for those ows were heavy-tailed.
In contrast, our pap er is based on data collected at the appli-
cation level rather than the network level. As a result weare
able to examine the relationship b etween transmission times
and le sizes, and are able to assess the eects of caching and
user preference on these distributions. These observations al-
low us to build on the conclusions presented in 28] byshowing
that the heavy-tailed nature of transmission and idle times is
not primarily a result of network protocols or user preference,
but rather stems from more basic properties of information
storage and processing: b oth le sizes and user \think times"
are themselves strongly heavy-tailed.

-4
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Log10(P[X>x])
Log10(x)
500-Aggregated
100-Aggregated
10-Aggregated
All Points
-4
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Log10(P[X>x])
Log10(x)
500-Aggregated
100-Aggregated
10-Aggregated
All Points
Figure 1: Comparison of CLTTest for Pareto (left) and Lognormal (right) Distributions
4 Examining Web Trac Self-Similarity
In this section we show evidence that WWW trac can be self-
similar. To do so, we rst describ e howwe measured WWW
trac then we apply the statistical metho ds describ ed in Sec-
tion 2 to assess self-similarity.
4.1 Data Collection
In order to relate trac patterns to higher-level eects, we
needed to capture asp ects of user behavior as well as network
demand. The approach we took to capturing both types of
data simultaneously was to mo dify a WWW browser so as to
log all user accesses to the Web. The browser we used was
Mosaic, since its source was publicly available and permission
has b een granted for using and modifying the co de for research
purposes. A complete description of our data collection meth-
ods and the format of the log les is given in 8] here we only
give a high-level summary.
We mo died Mosaic to record the Uniform Resource Lo-
cator (URL) 3] of
each le accessed bytheMosaic user, as
well as the time the le was accessed and the time required to
transfer the le from its server (if necessary). For complete-
ness, we record all URLs accessed whether they were served
from Mosaic's cache or via a le transfer however the traf-
c timeseries we analyze in this section consist only of actual
network transfers.
At the time of our study (January and February 1995)
Mosaic was the WWW browser preferred by nearly all users
at our site. Hence our data consists of nearly all of the WWW
trac at our site. Since the time of our study, the preferred
browser has become Netscape 6 ], which is not available in
source form. As a result, capturing an equivalent set of WWW
user traces at the current time would be signicantly more
dicult.
The data captured consists of the sequence of WWW le
requests p erformed during each session. Each le request is
identied by its URL, and session, user, and workstation ID
associated with the request is the time stamp when the request
was made, the size of the document (including the overhead
of the proto col) and the ob ject retrieval time. Timestamps
were accurate to 10 ms. Thus, to provide 3 signicantdigits
in our results, we limited our analysis to time intervals greater
than or equal to 1 sec. To convert our logs to trac time
series, it was necessary to allocate the bytes transferred in
each request equally into bins spanning the transfer duration.
Although this process smo oths out short-term variations in the
trac ow of each transfer, our restriction to time series with
Sessions 4,700
Users 591
URLs Requested 575,775
Files Transferred 130,140
Unique Files Requested 46,830
Bytes Requested 2713 MB
Bytes Transferred 1849 MB
Unique Bytes Requested 1088 MB
Table 1: Summary Statistics for Trace Data Used in This
Study
granularity of 1 second or more|combined with the fact that
most le transfers are short|means that such smo othing has
little eect on our results.
To collect our data we installed our instrumented version
of Mosaic in the general computing environment at Boston
University's Computer Science Department. This environ-
ment consists principally of 37 SparcStation-2 workstations
connected in a lo cal network. Eachworkstation has its own
local disk logs were written to the lo cal disk and subsequently
transferred to a central repository. Although we collected data
from 21 November 1994 through 8 May 1995, the data used
in this paper is only from the p eriod 17 January 1995 to 28
February 1995. This p erio d was chosen because departmen-
tal WWW usage was distinctly lower (and the p ool of users
less diverse) b efore the start of classes in early January and
because by early March 1995, Mosaic had ceased to be the
dominant browser at our site. A statistical summary of the
trace data used in this study is shown in Table 1.
4.2 Self-SimilarityofWWWTrac
Using the WWW trac data we obtained as describ ed in the
previous section, weshow evidence that WWW trac might
be self-similar. First, weshow that WWW trac contains traf-
c bursts observable over four orders of magnitude. Second,
we show that for four busy hours from our trac logs, the
Hurst parameter
H
for our datasets is signicantly dierent
from 1/2, consistent with a conclusion of self-similarity.
4.2.1 Burstiness at Varying Time Scales
One of the most imp ortant asp ects of self-similar trac is that
there is no characteristic size of a trac burst as a result,
the aggregation or superimposition of many such sources do es
not result in a smo other trac pattern. One wa
y to assess

Chronological time (slots of 1000 sec)
Bytes
0 1000 2000 3000
0 5*10^6 10^7 1.5*10^7 2*10^7 2.5*10^7 3*10^7
Chronological time (slots of 100 sec)
Bytes
23300 23400 23500 23600
0 2*10^6 4*10^6 6*10^6 8*10^6
Chronological time (slots of 10 sec)
Bytes
234200 234250 234300 234350 234400 234450
0 100000 200000 300000 400000 500000
Chronological time (slots of 1 sec)
Bytes
2342600 2342650 2342700 2342750 2342800 2342850 2342900
0 50000 100000 150000
Figure 2: Trac Bursts over Four Orders of Magnitude Upp er Left: 1000, Upp er Right: 100, Lower Left: 10, and Lower Right:
1 Second Aggegrations. (Actual Transfers)
this eect is by visually insp ecting time series plots of trac
demand.
In Figure 2 weshow four time series plots of the WWW
trac induced by our reference traces. The plots are pro duced
by aggregating byte trac into discrete bins of 1, 10, 100, or
1000 seconds.
The upper left plot is a complete presentation of the entire
trac time series using 1000 second (16.6 minute) bins. The
diurnal cycle of network demand is clearly evident, and day
to day activityshows noticeable bursts. However, even within
the active portion of a single day there is signicant burstiness
this is shown in the upper right plot, which uses a 100 second
timescale and is taken from a typical day in the middle of the
dataset. Finally, the lower left plot shows a portion of the 100
second plot, expanded to 10 second detail and the lower right
plot shows a portion of the lower left expanded to 1 second
detail. These plots show signicant bursts o ccurring at the
second-to-second level.
4.2.2 Statistical Analysis
We used the four methods for assessing self-similarity describ ed
in Section 2: the variance-time plot, the rescaled range (or
R/S) plot, the p erio dogram plot, and the Whittle estimator.
We concentrated on individual hours from our trac series, so
as to provide as nearly a stationary dataset as p ossible.
Toprovide an example of these approaches, analysis of a
single hour (4pm to 5pm, Thursday5Feb 1995) is shown in
Figure 3. The gure shows plots for the three graphical meth-
ods: variance-time (upp er left), rescaled range (upper right),
and p erio dogram (lower center). The variance-time plot is lin-
ear and shows a slop e that is distinctly dierent from -1 (which
is shown for comparison) the slope is estimated using regres-
sion as -0.48, yielding an estimate for
H
of 0.76. The R/S plot
shows an asymptotic slope that is dierent from 0.5 and from
1.0 (shown for comparision) it is estimated using regression
as 0.75, which is also the corresponding estimate of
H
. The
perio dogram plot shows a slop e of -0.66 (the regression line is
shown), yielding an estimate of
H
as 0.83. Finally, the Whittle
estimator for this dataset (not a graphical metho d) yields an
estimate of
H
=0
:
82 with a 95% condence interval of (0.77,
0.87).
As discussed in Section 2.1, the Whittle estimator is the
only method that yields condence in
tervals on
H
, but short-
range dep endence in the timeseries can introduce inaccura-
cies in its results. These inaccuracies are minimized by
m
-
aggregating the timeseries for successively large values of
m
,
and lo oking for a value of
H
around which the Whittle esti-
mator stabilizes.
The results of this metho d for four busy hours are shown in
Figure 4. Each hour is shown in one plot, from the busiest hour
in the upper left to the least busy hour in the lower right. In
these gures the solid line is the value of the Whittle estimate
of
H
as a function of the aggregation level
m
of the dataset.
The upper and lower dotted lines are the limits of the 95%
condence interval on
H
. The three level lines represent the
estimate of
H
for the unaggregated dataset as given by the
variance-time, R-S, and p erio dogram metho ds.
The gure shows that for each dataset, the estimate of
H
stays relatively consistent as the aggregation level is increased,
and that the estimates given by the three graphical metho ds

Citations
More filters
Journal ArticleDOI

Statistical mechanics of complex networks

TL;DR: In this paper, a simple model based on the power-law degree distribution of real networks was proposed, which was able to reproduce the power law degree distribution in real networks and to capture the evolution of networks, not just their static topology.
Proceedings ArticleDOI

On power-law relationships of the Internet topology

TL;DR: These power-laws hold for three snapshots of the Internet, between November 1997 and December 1998, despite a 45% growth of its size during that period, and can be used to generate and select realistic topologies for simulation purposes.
Journal ArticleDOI

The origin of bursts and heavy tails in human dynamics

TL;DR: It is shown that the bursty nature of human behaviour is a consequence of a decision-based queuing process: when individuals execute tasks based on some perceived priority, the timing of the tasks will be heavy tailed, with most tasks being rapidly executed, whereas a few experience very long waiting times.
Journal ArticleDOI

Summary cache: a scalable wide-area web cache sharing protocol

TL;DR: This paper demonstrates the benefits of cache sharing, measures the overhead of the existing protocols, and proposes a new protocol called "summary cache", which reduces the number of intercache protocol messages, reduces the bandwidth consumption, and eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP.
Journal ArticleDOI

A Brief History of Generative Models for Power Law and Lognormal Distributions

TL;DR: A rich and long history is found of how lognormal distributions have arisen as a possible alternative to power law distributions across many fields, focusing on underlying generative models that lead to these distributions.
References
More filters
Book

The Fractal Geometry of Nature

TL;DR: This book is a blend of erudition, popularization, and exposition, and the illustrations include many superb examples of computer graphics that are works of art in their own right.
Journal ArticleDOI

On the self-similar nature of Ethernet traffic (extended version)

TL;DR: It is demonstrated that Ethernet LAN traffic is statistically self-similar, that none of the commonly used traffic models is able to capture this fractal-like behavior, and that such behavior has serious implications for the design, control, and analysis of high-speed, cell-based networks.
Book

Time Series: Theory and Methods

TL;DR: In this article, the mean and autocovariance functions of ARIMA models are estimated for multivariate time series and state-space models, and the spectral representation of the spectrum of a Stationary Process is inferred.
Journal ArticleDOI

Wide area traffic: the failure of Poisson modeling

TL;DR: It is found that user-initiated TCP session arrivals, such as remote-login and file-transfer, are well-modeled as Poisson processes with fixed hourly rates, but that other connection arrivals deviate considerably from Poisson.
Journal ArticleDOI

A Simple General Approach to Inference About the Tail of a Distribution

Bruce M. Hill
- 01 Sep 1975 - 
TL;DR: In this paper, a simple general approach to inference about the tail behavior of a distribution is proposed, which is not required to assume any global form for the distribution function, but merely the form of behavior in the tail where it is desired to draw inference.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions in this paper?

In this paper the authors examine the mechanisms that give rise to the self similarity of network tra c The authors present a hypothesized explanation for the possible self similarity of tra c by using a particular sub set of wide area tra c tra c due to the World Wide Web WWW Using an extensive set of traces of actual user exe cutions of NCSA Mosaic re ecting over half a million requests for WWW documents they examine the dependence structure of WWW tra c While their measurements are not conclusive the authors show evidence that WWW tra c exhibits behavior that is consistent with self similar tra c models Then the authors show that the self similarity in such tra c can be explained based on the underlying distributions of WWW document sizes the e ects of caching and user preference in le transfer the e ect of user think time and the superimposition of many such transfers in a local area network