Self-similarity in World Wide Web traffic: evidence and possible causes
Mark Crovella,Azer Bestavros +1 more
Reads0
Chats0
TLDR
It is shown that the self-similarity in WWW traffic can be explained based on the underlying distributions of WWW document sizes, the effects of caching and user preference in file transfer, the effect of user "think time", and the superimposition of many such transfers in a local-area network.Abstract:
The notion of self-similarity has been shown to apply to wide-area and local-area network traffic. We show evidence that the subset of network traffic that is due to World Wide Web (WWW) transfers can show characteristics that are consistent with self-similarity, and we present a hypothesized explanation for that self-similarity. Using a set of traces of actual user executions of NCSA Mosaic, we examine the dependence structure of WWW traffic. First, we show evidence that WWW traffic exhibits behavior that is consistent with self-similar traffic models. Then we show that the self-similarity in such traffic can be explained based on the underlying distributions of WWW document sizes, the effects of caching and user preference in file transfer, the effect of user "think time", and the superimposition of many such transfers in a local-area network. To do this, we rely on empirically measured distributions both from client traces and from data independently collected at WWW servers.read more
In Proc. of the 1996 ACM SIGMETRICS Intl. Conference on Measurement and Modeling of Computer Systems, Philadelphia, PA, May 1996
Self-Similarity in World Wide Web Trac
Evidence and Possible Causes
Mark E. Crovella and Azer Bestavros
Computer Science Department
Boston University
Boston,
ma
02215
f
crovella,best
g
@cs.bu.edu
Abstract
Recently the notion of
self-similarity
has been shown to apply
to wide-area and lo cal-area network trac. In this pap er we
examine the mechanisms that give rise to the self-similarity
of network trac. We presentahypothesized explanation for
the possible self-similarity of trac by using a particular sub-
set of wide area trac: trac due to the World Wide Web
(WWW). Using an extensive set of traces of actual user exe-
cutions of NCSA Mosaic, reecting over half a million requests
for WWW do cuments, we examine the dependence structure
of WWW trac. While our measurements are not conclusive,
we show evidence that WWW trac exhibits behavior that
is
consistent with self-similar trac mo dels. Then we show
that the self-similarityinsuch trac can b e explained based
on the underlying distributions of WWW do cument sizes, the
eects of caching and user preference in le transfer, the eect
of user \think time", and the sup erimposition of many such
transfers in a lo cal area network. To do this we rely on empir-
ically measured distributions b oth from our traces and from
data indep endently collected at over thirty WWW sites.
1 Introduction
Understanding the nature of network trac is critical in order
to prop erly design and implement computer networks and net-
work services like the World Wide Web. Recent examinations
of LAN trac 16 ] and wide area network trac 20 ] havechal-
lenged the commonly assumed mo dels for network trac,
e.g.,
the Poisson distribution. Were trac to follow aPoisson or
Markovian arrival process, it would haveacharacteristic burst
length whichwould tend to b e smoothed byaveraging over a
long enough time scale. Rather, measurements of real trac
indicate that signicant trac variance (burstiness) is present
on a wide range of time scales.
Trac that is burstyonman
y or all time scales can be de-
scribed statistically using the notion of
self-similarity
, whichis
This work was supp orted in part by NSF grants CCR-9501822 and
CCR-9308344.
Permission to make digital or hard copies of part or all of this work for
personal or classro om use is granted without fee provided that copies
are not made or distributed for prot or commercial advantage and that
copies bear this notice and the full citation on the rst page. Copy-
rights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to
republish, to p ost on servers or to redistribute to lists, requires prior
specic p ermission and/or a fee.
SIGMETRICS 96-5/96 Philadelphia, PA, USA
c
1996 ACM
a propertywe asso ciate with fractals|ob jects whose app ear-
ance is unchanged regardless of the scale at which they are
viewed. In the case of sto chastic ob jects like timeseries, self-
similarity is used in the distributional sense: when viewed at
varying scales, the ob ject's distribution remains unchanged.
Since a self-similar pro cess has observable bursts on all time
scales, it exhibits
long-range dependence
values at any instant
are typically correlated with all future values. Surprisingly
(given the counterintuitive aspects of long-range dep endence)
the self-similarity of Ethernet network trac has b een rigor-
ously established 16 ].
The imp ortance of long-range dep en-
dence in network trac is beginning to be observed in studies
such as 15], whichshow that packet loss and delay b ehavior is
radically dierentinsimulations using real trac data rather
than traditional network models.
However, the reasons b ehind network trac self-similarity
have not b een clearly identied. In this paper we showthatin
some cases, network trac self-similarity can b e explained in
terms of le system characteristics and user b ehavior. In the
process, we trace the genesis of network trac self-similarity
back from the trac itself, through the actions of le trans-
mission, caching systems, and user choice, to the distributions
of le sizes and user eventinterarrivals.
To bridge the gap b etween studying network trac on one
hand and high-level system characteristics on the other, we
make use of twoessential to ols. First, to explain self-similar
network trac in terms of individual transmission lengths, we
employ the mechanism intro duced in 17] and describ ed in 16 ].
Those pap ers point out that self-similar trac can be con-
structed bymultiplexing a large numb er of ON/OFF sources
that have ON and OFF p erio d lengths that are heavy-tailed,
as dened in Section 2.2. Such a mechanism could correspond
to a network of workstations, eachofwhich is either silentor
transferring data at a constant rate.
Our second tool in bridging the gap b etween transmission
times and high-level system characteristics is our use of the
World Wide Web (WWW or Web) as an ob ject of study.The
Web provides a sp ecial opportunity for studying network traf-
c b ecause it is a \closed" system: all trac arises as the result
of le transfers from an easily studied set, and user activityis
easily monitored.
To study the trac patterns of the WWW we collected
reference data reecting actual WWW use at our site. Wein-
strumented NCSA Mosaic 9] to capture user access patterns
to the Web. Since at the time of our data collection, Mo-
saic was by far the dominant WWW browser at our site, we
were able to capture a fairly complete picture of Web trac
on our local network our dataset consists of more than half
a million user requests for do cument transfers,
and includes
detailed timing of requests and transfer lengths. In addition
we surveyed a number of WWW servers to capture do cument
size information that we used to validate assumptions made in
our analysis.
The pap er takes two parts. First, we consider the possi-
bility of self-similarityofWeb trac for the busiest hours we
measured. Todosowe use analyses very similar to those p er-
formed in 16]. These analyses support the notion that Web
trac mayshow self-similar characteristics, at least when de-
mand is high enough. This result in itself has implications
for designers of systems that attempt to improve performance
characteristics of the WWW.
Second, using our WWW trac, user preference, and le
size data, we comment on reasons why the transmission times
and quiet times for any particular Web session are heavy-
tailed, which is an essential characteristic of the proposed
mechanism for self-similarity of trac. In particular, wear-
gue that manycharacteristics of WWW use can b e modelled
using heavy-tailed distributions, including the distribution of
transfer times, the distribution of user requests for documents,
and the underlying distribution of do cuments sizes available in
the Web. In addition, using our measurements of user inter-
request times, we explore reasons for the heavy-tailed distri-
bution of quiet times needed for self-similarity.
2 Background
2.1 Denition of Self-Similarity
For detailed discussion of self-similarity in timeseries data and
the accompanying statistical tests, see 2, 27]. Our discussion
in this subsection and the next closely follows those sources.
A self-similar time series has the prop erty that when ag-
gregated (leading to a shorter time series in whicheach point
is the sum of multiple original p oints) the new series has the
same auto correlation function as the original. That is, given
a stationary timeseries
X
=(
X
t
t
=0
1
2
:::
), we dene the
m
-aggregated series
X
(
m
)
=(
X
(
m
)
k
:
k
=1
2
3
:::
) by sum-
ming the original series
X
over nonoverlapping blo cks of size
m
. Then if
X
is self-similar, it has the same auto correlation
function
r
(
k
)=
E
(
X
t
;
)(
X
t
+
k
;
)] as the series
X
(
m
)
for
all
m
. Note that this means that the series is
distributional ly
self-similar: the distribution of the aggregated series is the
same (except for changes in scale) as that of the original.
As a result, self-similar pro cesses show
long-range depen-
dence.
A process with long-range dependence has an autocor-
relation function
r
(
k
)
k
;
as
k
!1
, where 0
< <
1.
Thus the auto correlation function of such a pro cess decays
hyperbolically (as compared to the exp onential decayexhib-
ited by traditional trac mo dels). Hyp erbolic decayismuch
slower than exp onential decay, and since
<
1, the sum of
the autocorrelation values of such a series approaches innity.
This has a number of implications. First, the variance of
n
samples from such a series does not decrease as a function of
n
(as predicted by basic statistics for uncorrelated datasets)
but rather bythe value
n
;
. Second, the p ower spectrum of
such a series is hyperb olic, rising to innity at frequency zero
| reecting the \innite" inuence of long-range dependence
in the data.
One of the attractive features of using self-similar mod-
els for time series, when appropriate, is that the degree of
self-similarity of a series is expressed using only a single pa-
rameter. The parameter expresses the sp eed of decay of the
series' autocorrelation function. For historical reasons, the pa-
rameter used is the
Hurst
parameter
H
=1
;
=
2. Thus, for
self-similar series, 1
=
2
< H <
1. As
H
!
1, the degree of
self-similarity increases. Thus the fundamental test for self-
similarity of a series reduces to the question of whether
H
is
signicantly dierent from 1
=
2.
In this pap er we use four metho ds to test for self-similarity.
These metho ds are describ ed fully in 2 ] and are the same
methods describ ed and used in 16]. A summary of the relative
accuracy of these methods on synthetic datasets is presented
in 24].
The rst method, the
variance-time plot,
relies on the slowly
decaying variance of a self-similar series. The variance of
X
(
m
)
is plotted against
m
on a log-log plot a straight line with slop e
(
) greater than -1 is indicative of self-similarity,andthepa-
rameter
H
is given by
H
= 1
;
=
2. The second metho d,
the
R=S
plot, uses the fact that for a self-similar dataset, the
rescaled range
or
R=S
statistic grows according to a power
law with exponent
H
as a function of the number of p oints
included (
n
). Thus the plot of
R=S
against
n
on a log-log plot
has slop e whichisanestimateof
H
. The third approach, the
periodogram
method, uses the slope of the power spectrum of
the series as frequency approaches zero. On a log-log plot, the
perio dogram slope is a straight line with slop e
;
1=1
;
2
H
close to the origin.
While the preceding three graphical methods are useful for
exposing faulty assumptions (such as non-stationarity in the
dataset) they do not provide condence intervals. The fourth
method, called the
Whittle estimator
does provide a condence
interval, but has the drawback that the form of the underly-
ing sto chastic process must b e supplied. The two forms that
are most commonly used are fractional Gaussian noise (FGN)
with parameter 1
=
2
<H <
1, and Fractional ARIMA (
p d q
)
with 0
<d<
1
=
2 (for details see 2, 4]). These two mo dels dif-
fer in their assumptions about the short-range dependences in
the datasets FGN assumes no short-range dependence while
Fractional ARIMA can assume a xed degree of short-range
dependence.
Since we are concerned only with the long-range dep en-
dence of our datasets, we employ the Whittle estimator as
follows. Each hourly dataset is aggregated at increasing levels
m
, and the Whittle estimator is applied to each
m
-aggregated
dataset using the FGN model. The resulting estimates of
H
and condence intervals are plotted as a function of
m
. This
approach exploits the propertythat any long-range dependent
process approaches FGN when aggregated to a sucient level.
As
m
increases short-range dependences are averaged out of
the dataset if the value of
H
remains relatively constantwe
can be condent that it measures a true underlying level of
self-similarity.
2.2 Heavy-Tailed Distributions
The distributions we use in this pap er have the propertyof
being
heavy-tailed.
A distribution is heavy-tailed if
P
X>x
]
x
;
as
x
!1
0
<<
2
:
That is, regardless of the behavior of the distribution for small
values of the random variable, if the asymptotic shap e of the
distribution is hyperb olic, it is heavy-tailed.
The simplest heavy-tailed distribution is the
Pareto
distri-
bution. The Pareto distribution is hyperb olic over its entire
range its probability mass function is
p
(
x
)=
k
x
;
;
1
k >
0
x
k:
and its cumulative distribution function is given by
F
(
x
)=
P
X
x
]=1
;
(
k=x
)
The parameter
k
represents the smallest p ossible value of the
random variable.
Our results are based on estimating the values of
for
a number of empirically measured distributions, suchasthe
lengths of World Wide Web le transmission times. Todoso,
we employ
log-log complementary distribution
(LLCD) plots.
These are plots of the complementary cumulative distribution
F
(
x
)=1
;
F
(
x
)=
P
X>x
] on log-log axes. Plotted in this
way,heavy-tailed distributions have the property that
d
log
F
(
x
)
d
log
x
=
;
x>
for some
. In practice we obtain an estimate for
by plotting
the LLCD plot of the dataset and selecting a value for
above
which the plot app ears to b e linear. Then we select equally-
spaced points from among the LLCD points larger than
and
estimate the slop e using least-squares regression. Equally-
spaced p oints are used b ecause the pointdensityvaries over
the range used, and the preponderance of data points near the
median would otherwise unduly inuence the least-squares re-
gression.
In all our
estimates for le sizes weuse
= 1000 meaning
that we consider tails to be the portions of the distributions
for les of 1,000 bytes or greater.
An alternative approach to estimating tail weight, used in
28], is the
Hil l
estimator 11 ]. The Hill estimator do es not give
a single estimate of
, but can b e used to gauge the general
range of
s that are reasonable. We used the Hill estimator
to check that the estimates of
obtained using the LLCD
method were within range in all cases they were.
2.2.1 Testing for Innite Variance
There is evidence that, over their entire range, many of the dis-
tributions we study maybe well characterized using
lognormal
distributions 19]. However, lognormal distributions do not
have innite variance, and hence are not heavy-tailed. In our
work, we are not concerned with distributions over their entire
range|only their tails. As a result we don't use goo dness-of-t
tests to determine whether Pareto or lognormal distributions
are b etter at describing our data. However,itisimportantto
verify that our datasets exhibit the innite variance charac-
teristic of heavy tails. Todosowe use a simple test based on
the Central Limit Theorem (CLT), which states that the sum
of a large numb er of i.i.d. samples from any distribution
with
nite variance
will tend to b e normally distributed. Totest
for innite variance we pro ceed as follows. First, form the
m
-
aggregrated dataset from the original dataset for large values
of
m
(typically in the range 10 to 1000). Next, we inspect the
tail behavior of the aggregated datasets using the LLCD plot.
For datasets with nite variance, the slope will increasingly
decline as
m
increases, reecting the underlying distribution's
approximation of a normal distribution. For datasets with in-
nite variance, the slope will remain roughly constant with
increasing
m
.
An example is shown in Figure 1. The gure shows the CLT
test for aggregation levels of 10, 100, and 500 as applied to two
synthetic datasets. On the left the dataset consists of 10,000
samples from a Pareto distribution with
= 1
:
0. On the
right the dataset consists of 10,000 samples from a lognormal
distribution with
= 2
:
0
= 2
:
0. These parameters were
chosen so as to make the Pareto and lognormal distributions
appear approximately similar for log
10
(
x
) in the range 0 to
4. In each plot the original LLCD plot for the dataset is the
lowermost line the upper lines are the LLCD plots of
the
aggregated datasets. Increasing aggregation level increases the
average value of the p oints in the dataset (since the sums are
not normalized by the new mean) so greater aggregation levels
show up as higher lines in the plot. The gure clearly shows
the qualitative dierence b etween nite and innite variance
datasets. The Pareto dataset is characterized by parallel lines,
while the lognormal dataset is characterized by lines that seem
roughly convergent.
3 Related Work
The rst step in understanding WWW trac is the collec-
tion of trace data. Previous measurement studies of the Web
have fo cused on reference patterns established based on logs
of proxies 10 , 23], or servers 21]. The authors in 5 ] captured
client traces, but they concentrated on events at the user in-
terface level in order to study browser and page design. In
contrast, our goal in data collection was to acquire a complete
picture of the reference b ehavior and timing of user accesses
to the WWW. As a result, we collected a large dataset of
client-based traces. A full description of our traces (whichare
available for anonymous FTP) is given in 8].
Previous wide-area trac studies have studied FTP, TEL-
NET, NNTP, and SMTP trac 19, 20]. Our data comple-
ments those studies byproviding a view of WWW (HTTP)
trac at a \stub" network. In addition, our measurements of
Web le sizes are in general agreement with those reported in
1]. Since WWW trac accounts for more than 25% of the
trac on the Internet and is currently growing more rapidly
than any other trac type 12], understanding the nature of
WWW trac is important and is exp ected to increase in im-
portance.
The b enchmark study of self-similarity in network trac is
14, 16 ], and our study uses many of the same metho ds used
therein. However, the goal of that study was to demonstrate
the self-similarity of network trac to do that, manylarge
datasets taken from a multi-year span were used. Our fo cus is
not on establishing self-similarity of network trac (although
wedoso for the interesting subset of network trac that is
Web-related) instead we concentrate on examining the rea-
sons b ehind that self-similarity. As a result of this dierent
focus, we do not analyze trac datasets for low, normal, and
busy hours. Instead we focus on the four busiest hours in our
logs. While these four hours are self-similar, many less-busy
hours in our logs do not show self-similar characteristics. We
feel that this is only the result of the trac demand presentin
our logs, whichismuchlower than that used in 14, 16] this
belief is supp orted by the ndings in that study, which showed
that the intensity of self-similarity increases as the aggregate
trac level increases.
Our work is most similar in intent to 28]. That paper
looked at network trac at the packet level, identied the ows
between individual source/destination pairs, and showed that
transmission and idle times for those ows were heavy-tailed.
In contrast, our pap er is based on data collected at the appli-
cation level rather than the network level. As a result weare
able to examine the relationship b etween transmission times
and le sizes, and are able to assess the eects of caching and
user preference on these distributions. These observations al-
low us to build on the conclusions presented in 28] byshowing
that the heavy-tailed nature of transmission and idle times is
not primarily a result of network protocols or user preference,
but rather stems from more basic properties of information
storage and processing: b oth le sizes and user \think times"
are themselves strongly heavy-tailed.
-4
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Log10(P[X>x])
Log10(x)
500-Aggregated
100-Aggregated
10-Aggregated
All Points
-4
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Log10(P[X>x])
Log10(x)
500-Aggregated
100-Aggregated
10-Aggregated
All Points
Figure 1: Comparison of CLTTest for Pareto (left) and Lognormal (right) Distributions
4 Examining Web Trac Self-Similarity
In this section we show evidence that WWW trac can be self-
similar. To do so, we rst describ e howwe measured WWW
trac then we apply the statistical metho ds describ ed in Sec-
tion 2 to assess self-similarity.
4.1 Data Collection
In order to relate trac patterns to higher-level eects, we
needed to capture asp ects of user behavior as well as network
demand. The approach we took to capturing both types of
data simultaneously was to mo dify a WWW browser so as to
log all user accesses to the Web. The browser we used was
Mosaic, since its source was publicly available and permission
has b een granted for using and modifying the co de for research
purposes. A complete description of our data collection meth-
ods and the format of the log les is given in 8] here we only
give a high-level summary.
We mo died Mosaic to record the Uniform Resource Lo-
cator (URL) 3] of
each le accessed bytheMosaic user, as
well as the time the le was accessed and the time required to
transfer the le from its server (if necessary). For complete-
ness, we record all URLs accessed whether they were served
from Mosaic's cache or via a le transfer however the traf-
c timeseries we analyze in this section consist only of actual
network transfers.
At the time of our study (January and February 1995)
Mosaic was the WWW browser preferred by nearly all users
at our site. Hence our data consists of nearly all of the WWW
trac at our site. Since the time of our study, the preferred
browser has become Netscape 6 ], which is not available in
source form. As a result, capturing an equivalent set of WWW
user traces at the current time would be signicantly more
dicult.
The data captured consists of the sequence of WWW le
requests p erformed during each session. Each le request is
identied by its URL, and session, user, and workstation ID
associated with the request is the time stamp when the request
was made, the size of the document (including the overhead
of the proto col) and the ob ject retrieval time. Timestamps
were accurate to 10 ms. Thus, to provide 3 signicantdigits
in our results, we limited our analysis to time intervals greater
than or equal to 1 sec. To convert our logs to trac time
series, it was necessary to allocate the bytes transferred in
each request equally into bins spanning the transfer duration.
Although this process smo oths out short-term variations in the
trac ow of each transfer, our restriction to time series with
Sessions 4,700
Users 591
URLs Requested 575,775
Files Transferred 130,140
Unique Files Requested 46,830
Bytes Requested 2713 MB
Bytes Transferred 1849 MB
Unique Bytes Requested 1088 MB
Table 1: Summary Statistics for Trace Data Used in This
Study
granularity of 1 second or more|combined with the fact that
most le transfers are short|means that such smo othing has
little eect on our results.
To collect our data we installed our instrumented version
of Mosaic in the general computing environment at Boston
University's Computer Science Department. This environ-
ment consists principally of 37 SparcStation-2 workstations
connected in a lo cal network. Eachworkstation has its own
local disk logs were written to the lo cal disk and subsequently
transferred to a central repository. Although we collected data
from 21 November 1994 through 8 May 1995, the data used
in this paper is only from the p eriod 17 January 1995 to 28
February 1995. This p erio d was chosen because departmen-
tal WWW usage was distinctly lower (and the p ool of users
less diverse) b efore the start of classes in early January and
because by early March 1995, Mosaic had ceased to be the
dominant browser at our site. A statistical summary of the
trace data used in this study is shown in Table 1.
4.2 Self-SimilarityofWWWTrac
Using the WWW trac data we obtained as describ ed in the
previous section, weshow evidence that WWW trac might
be self-similar. First, weshow that WWW trac contains traf-
c bursts observable over four orders of magnitude. Second,
we show that for four busy hours from our trac logs, the
Hurst parameter
H
for our datasets is signicantly dierent
from 1/2, consistent with a conclusion of self-similarity.
4.2.1 Burstiness at Varying Time Scales
One of the most imp ortant asp ects of self-similar trac is that
there is no characteristic size of a trac burst as a result,
the aggregation or superimposition of many such sources do es
not result in a smo other trac pattern. One wa
y to assess
Chronological time (slots of 1000 sec)
Bytes
0 1000 2000 3000
0 5*10^6 10^7 1.5*10^7 2*10^7 2.5*10^7 3*10^7
Chronological time (slots of 100 sec)
Bytes
23300 23400 23500 23600
0 2*10^6 4*10^6 6*10^6 8*10^6
Chronological time (slots of 10 sec)
Bytes
234200 234250 234300 234350 234400 234450
0 100000 200000 300000 400000 500000
Chronological time (slots of 1 sec)
Bytes
2342600 2342650 2342700 2342750 2342800 2342850 2342900
0 50000 100000 150000
Figure 2: Trac Bursts over Four Orders of Magnitude Upp er Left: 1000, Upp er Right: 100, Lower Left: 10, and Lower Right:
1 Second Aggegrations. (Actual Transfers)
this eect is by visually insp ecting time series plots of trac
demand.
In Figure 2 weshow four time series plots of the WWW
trac induced by our reference traces. The plots are pro duced
by aggregating byte trac into discrete bins of 1, 10, 100, or
1000 seconds.
The upper left plot is a complete presentation of the entire
trac time series using 1000 second (16.6 minute) bins. The
diurnal cycle of network demand is clearly evident, and day
to day activityshows noticeable bursts. However, even within
the active portion of a single day there is signicant burstiness
this is shown in the upper right plot, which uses a 100 second
timescale and is taken from a typical day in the middle of the
dataset. Finally, the lower left plot shows a portion of the 100
second plot, expanded to 10 second detail and the lower right
plot shows a portion of the lower left expanded to 1 second
detail. These plots show signicant bursts o ccurring at the
second-to-second level.
4.2.2 Statistical Analysis
We used the four methods for assessing self-similarity describ ed
in Section 2: the variance-time plot, the rescaled range (or
R/S) plot, the p erio dogram plot, and the Whittle estimator.
We concentrated on individual hours from our trac series, so
as to provide as nearly a stationary dataset as p ossible.
Toprovide an example of these approaches, analysis of a
single hour (4pm to 5pm, Thursday5Feb 1995) is shown in
Figure 3. The gure shows plots for the three graphical meth-
ods: variance-time (upp er left), rescaled range (upper right),
and p erio dogram (lower center). The variance-time plot is lin-
ear and shows a slop e that is distinctly dierent from -1 (which
is shown for comparison) the slope is estimated using regres-
sion as -0.48, yielding an estimate for
H
of 0.76. The R/S plot
shows an asymptotic slope that is dierent from 0.5 and from
1.0 (shown for comparision) it is estimated using regression
as 0.75, which is also the corresponding estimate of
H
. The
perio dogram plot shows a slop e of -0.66 (the regression line is
shown), yielding an estimate of
H
as 0.83. Finally, the Whittle
estimator for this dataset (not a graphical metho d) yields an
estimate of
H
=0
:
82 with a 95% condence interval of (0.77,
0.87).
As discussed in Section 2.1, the Whittle estimator is the
only method that yields condence in
tervals on
H
, but short-
range dep endence in the timeseries can introduce inaccura-
cies in its results. These inaccuracies are minimized by
m
-
aggregating the timeseries for successively large values of
m
,
and lo oking for a value of
H
around which the Whittle esti-
mator stabilizes.
The results of this metho d for four busy hours are shown in
Figure 4. Each hour is shown in one plot, from the busiest hour
in the upper left to the least busy hour in the lower right. In
these gures the solid line is the value of the Whittle estimate
of
H
as a function of the aggregation level
m
of the dataset.
The upper and lower dotted lines are the limits of the 95%
condence interval on
H
. The three level lines represent the
estimate of
H
for the unaggregated dataset as given by the
variance-time, R-S, and p erio dogram metho ds.
The gure shows that for each dataset, the estimate of
H
stays relatively consistent as the aggregation level is increased,
and that the estimates given by the three graphical metho ds
Citations
More filters
Journal ArticleDOI
Statistical mechanics of complex networks
TL;DR: In this paper, a simple model based on the power-law degree distribution of real networks was proposed, which was able to reproduce the power law degree distribution in real networks and to capture the evolution of networks, not just their static topology.
Proceedings ArticleDOI
On power-law relationships of the Internet topology
TL;DR: These power-laws hold for three snapshots of the Internet, between November 1997 and December 1998, despite a 45% growth of its size during that period, and can be used to generate and select realistic topologies for simulation purposes.
Journal ArticleDOI
The origin of bursts and heavy tails in human dynamics
TL;DR: It is shown that the bursty nature of human behaviour is a consequence of a decision-based queuing process: when individuals execute tasks based on some perceived priority, the timing of the tasks will be heavy tailed, with most tasks being rapidly executed, whereas a few experience very long waiting times.
Journal ArticleDOI
Summary cache: a scalable wide-area web cache sharing protocol
TL;DR: This paper demonstrates the benefits of cache sharing, measures the overhead of the existing protocols, and proposes a new protocol called "summary cache", which reduces the number of intercache protocol messages, reduces the bandwidth consumption, and eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP.
Journal ArticleDOI
A Brief History of Generative Models for Power Law and Lognormal Distributions
TL;DR: A rich and long history is found of how lognormal distributions have arisen as a possible alternative to power law distributions across many fields, focusing on underlying generative models that lead to these distributions.
References
More filters
Book
The Fractal Geometry of Nature
TL;DR: This book is a blend of erudition, popularization, and exposition, and the illustrations include many superb examples of computer graphics that are works of art in their own right.
Journal ArticleDOI
On the self-similar nature of Ethernet traffic (extended version)
TL;DR: It is demonstrated that Ethernet LAN traffic is statistically self-similar, that none of the commonly used traffic models is able to capture this fractal-like behavior, and that such behavior has serious implications for the design, control, and analysis of high-speed, cell-based networks.
Book
Time Series: Theory and Methods
TL;DR: In this article, the mean and autocovariance functions of ARIMA models are estimated for multivariate time series and state-space models, and the spectral representation of the spectrum of a Stationary Process is inferred.
Journal ArticleDOI
Wide area traffic: the failure of Poisson modeling
Vern Paxson,Sally Floyd +1 more
TL;DR: It is found that user-initiated TCP session arrivals, such as remote-login and file-transfer, are well-modeled as Poisson processes with fixed hourly rates, but that other connection arrivals deviate considerably from Poisson.
Journal ArticleDOI
A Simple General Approach to Inference About the Tail of a Distribution
TL;DR: In this paper, a simple general approach to inference about the tail behavior of a distribution is proposed, which is not required to assume any global form for the distribution function, but merely the form of behavior in the tail where it is desired to draw inference.