scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Locality-aware request distribution in cluster-based network servers

01 Oct 1998-Vol. 33, Iss: 11, pp 205-216
TL;DR: A simple, practical strategy for locality-aware request distribution (LARD), in which the front-end distributes incoming requests in a manner that achieves high locality in the back-ends' main memory caches as well as load balancing.
Abstract: We consider cluster-based network servers in which a front-end directs incoming requests to one of a number of back-ends. Specifically, we consider content-based request distribution: the front-end uses the content requested, in addition to information about the load on the back-end nodes, to choose which back-end will handle this request. Content-based request distribution can improve locality in the back-ends' main memory caches, increase secondary storage scalability by partitioning the server's database, and provide the ability to employ back-end nodes that are specialized for certain types of requests.As a specific policy for content-based request distribution, we introduce a simple, practical strategy for locality-aware request distribution (LARD). With LARD, the front-end distributes incoming requests in a manner that achieves high locality in the back-ends' main memory caches as well as load balancing. Locality is increased by dynamically subdividing the server's working set over the back-ends. Trace-based simulation results and measurements on a prototype implementation demonstrate substantial performance improvements over state-of-the-art approaches that use only load information to distribute requests. On workloads with working sets that do not fit in a single server node's main memory cache, the achieved throughput exceeds that of the state-of-the-art approach by a factor of two to four.With content-based distribution, incoming requests must be handed off to a back-end in a manner transparent to the client, after the front-end has inspected the content of the request. To this end, we introduce an efficient TCP handoflprotocol that can hand off an established TCP connection in a client-transparent manner.

Summary (3 min read)

2.2 Aiming for Balanced Load

  • This strategy produces good load balancing among the back-ends.
  • If this working set exceeds the size of main memory available for caching documents, frequent cache misses will occur.

2.3 Aiming for Locality

  • A good hashing function partitions both the name space and the working set more or less evenly among the back-ends.
  • If this is the case, the cache in each b a c k-end should achieve a m uch higher hit rate, since it is only trying to cache its subset of the working set, rather than the entire working set, as with load balancing based approaches.
  • What is a good partitioning for locality may, h o wever, easily prove a poor choice of partitioning for load balancing.
  • If a small set of targets in the working set account for a large fraction of the incoming requests, the back-ends serving those targets will be far more loaded than others.

2.4 Basic Locality-Aware Request Distribution

  • Simulations to test the sensitivity of their strategy to these parameter settings show that the maximal delay di erence increases approximately linearly with Thigh ; Tlow.
  • The throughput increases mildly and eventually attens as Thigh;Tlow increases.
  • Thigh should be set to the largest possible value that still satis es the desired bound on the delay di erence between back-end nodes.
  • The setting of Tlow can be conservatively high with no adverse impact on throughput and only a mild increase in the average delay.
  • F urthermore, if desired, the setting of Tlow can be easily automated by requesting explicit load information from the back-end nodes during a \training phase".

2.5 LARD with Replication

  • A potential problem with the basic LARD strategy is that a given target is served by only a single node at any given time.
  • If a single target causes a back-end to go into an overload situation, the desirable action is to assign several back-end nodes to serve that document, and to distribute requests for that target among the serving nodes.
  • The front-end maintains a mapping from targets to a set of nodes that serve the target.
  • Requests for a target are assigned to the least loaded node in the target's server set.
  • This ensures that the degree of replication for a target does not remain unnecessarily high once it is requested less often.

2.6 Discussion

  • This can be of concern in servers with very large databases.
  • The mappings can be maintained in an LRU cache, where assignments for targets that have not been accessed recently are discarded.
  • Discarding mappings for such targets is of little consequence, as these targets have most likely been evicted from the back-end nodes' caches anyway.

3.1 Simulation Model

  • The cache replacement policy the authors c hose for all simulations is Greedy-Dual-Size (GDS), as it appears to be the best known policy for Web workloads 5].
  • The authors have also performed simulations with LRU, where les with a size of more than 500KB are never cached.
  • The relative performance of the various distribution strategies remained largely una ected.

3.3 Simulation Outputs

  • This value was determined by inspection of the simulator's disk and CPU activity statistics as a point b e l o w which a node's disk and CPU both had some idle time in virtually all cases.
  • The cache hit rate gives an indication of how w ell locality is being maintained, and the node underutilization times indicate how w ell load balancing is maintained.

4 Simulation Results

  • The throughput achieved with LARD/R exceeds that of LARD slightly for seven or more nodes, while achieving lower cache miss ratio and lower idle time.
  • While WRR/GMS achieves a substantial performance advantage over WRR, its throughput remains below 5 0 % o f LARD and LARD/R's throughput for all cluster sizes.
  • 10 shows the throughput results obtained for the various strategies on the IBM trace (www.ibm.com).
  • The average le size is smaller than in the Rice trace, resulting in much larger throughput numbers for all strategies.
  • Thus, LARD and LARD/R achieve superlinear speedup only up to 4 nodes in this trace, resulting in a throughput that is slighly more than twice that of WRR for 4 nodes and above.

4.2 Other Workloads

  • The authors also ran simulations on a trace from the IBM web server hosting the Deep Blue/Kasparov Chess match i n : LARD vs CPU May 1997.
  • The working set of this trace is very small and achieves a low miss ratio with a main memory cache of a single node (32 MB).
  • This trace presents a best-case scenario for WRR and a w orst-case scenario for LARD, as there is nothing to be gained from an aggregation of cache size, but there is the potential to lose performance due to imperfect load balancing.
  • The authors results show that both LARD and LARD/R closely match the performance of WRR on this trace.
  • This is reassuring, as it demonstrates that their strategy can match the performance of WRR even under conditions that are favorable to WRR.

4.4 Delay

  • Connection establishment, hando , and forwarding are independent for di erent connections, and can be easily parallelized 24].
  • The dispatcher, on the other hand, requires shared state and thus synchronization among the CPUs.
  • With a simple policy such as LARD/R, the time spent in the dispatcher amounts to only a small fraction of the hando overhead (10-20%).
  • Therefore, the authors fully expect that the front-end performance can be scaled to larger clusters e ectively using an inexpensive SMP platform equipped with multiple network interfaces.

6.3 Cluster Performance Results

  • IBM's Lava project 18] uses the concept of a \hit server".
  • The hit server is a specially con gured server node responsible for serving cached content.
  • Its specialized OS and client-server protocols give it superior performance for handling HTTP requests of cached documents, but limits it to private Intranets.
  • Requests for uncached documents and dynamic content are delegated to a separate, conventional HTTP server node.
  • The authors work shares some of the same goals, but maintains standard client-server protocols, maintains support for dynamic content generation, and focuses on cluster servers.

8 Conclusion

  • Caching can also be e ective for dynamically generated content 15].
  • Moreover, resources required for dynamic content generation like server processes, executables, and primary data les are also cacheable.
  • While further research i s required, the authors expect that increased locality can bene t dynamic content serving, and that therefore the advantages of LARD also apply to dynamic content.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Lo cality-Aware Request Distribution in Cluster-based Network Servers
Vivek S. Pai
z
, Mohit Aron
y
,GauravBanga
y
,
Michael Svendsen
y
,Peter Druschel
y
, Willy Zwaenepoel
y
, ErichNahum
{
z
Department of Electrical and Computer Engineering, Rice University
y
Department of Computer Science, Rice University
{
IBM T.J. Watson ResearchCenter
Abstract
We consider cluster-based network servers in whicha
front-end directs incoming requests to one of a num-
ber of back-ends. Specically,we consider
content-based
request distribution
: the front-end uses the contentre-
quested, in addition to information ab out the load on
the back-end nodes, to choose whichback-end will han-
dle this request. Content-based request distribution can
improve locality in the back-ends' main memory caches,
increase secondary storage scalabilityby partitioning
the server's database, and provide the ability to employ
back-end no des that are sp ecialized for certain types of
requests.
As a sp ecic p olicy for content-based request dis-
tribution, weintroduce a simple, practical strategy
for
locality-aware
request distribution (LARD). With
LARD, the front-end distributes incoming requests in
a manner that achieves high lo cality in the back-ends'
main memory caches as well as load balancing. Local-
ity is increased by dynamically subdividi ng the server's
working set over the back-ends. Trace-based simulation
results and measurements on a prototype implemen-
tation demonstrate substantial p erformance improve-
ments over state-of-the-art approaches that use only
load information to distribute requests. On workloads
with working sets that do not t in a single server node's
main memory cache, the achieved throughput exceeds
that of the state-of-the-art approachby a factor of two
to four.
With content-based distribution, incoming requests
must be handed o to a back-end in a manner trans-
parent to the client,
after
the front-end has inspected
the content of the request. To this end, weintroduce an
ecient
TCP hando protocol
that can hand o an es-
tablished TCP connection in a client-transparent man-
ner.
To appear in the Pro ceedings of the Eighth International
Conference on Architectural Supp ort for Programming Lan-
guages and Operating Systems (ASPLOS-VII I), San Jose,
CA, Oct 1998.
1 Introduction
Network servers based on clusters of commo ditywork-
stations or PCs connected by high-speed LANs combine
cutting-edge performance and low cost. A cluster-based
network server consists of a front-end, resp onsible for re-
quest distribution, and a number of back-end nodes, re-
sponsible for request pro cessing. The use of a front-end
makes the distributed nature of the server transparent
to the clients. In most current cluster servers the front-
end distributes requests to back-end nodes without re-
gard to the type of service or the content requested.
That is, all back-end no des are considered equally capa-
ble of serving a given request and the only factor guiding
the request distribution is the current load of the back-
end no des.
With
content-basedrequest distribution
, the front-
end takes into account both the service/contentre-
quested and the current load on the back-end nodes
when deciding which back-end node should serve a given
request. The p otential advantages of content-based re-
quest distribution are: (1) increased performance due
to improved hit rates in the back-end's main memory
caches, (2) increased secondary storage scalabili tydue
to the ability to partition the server's database over the
dierentback-end nodes, and (3) the ability to employ
back-end no des that are sp ecialized for certain types of
requests (e.g., audio and video).
The
locality-awarerequest distribution
(LARD) strat-
egy presented in this pap er is a form of content-based
request distribution, focusing on obtaining the rst of
the advantages cited ab ove, namely improved cache hit
rates in the back-ends. Secondary storage scalability
and sp ecial-purpose back-end no des are not discussed
any further in this pap er.
Figure 1 illustrates the principle of LARD in a simple
server with twoback-ends and three targets
1
(A,B,C) in
the incoming request stream. The front-end directs all
requests for
A
to back-end 1, and all requests for
B
and
C
to back-end 2. By doing so, there is an increased like-
lihoo d that the request nds the requested target in the
cache at the back-end. In contrast, with a round-robin
distribution of incoming requests, requests of all three
1
In the following discussion, the term
target
is being used
to refer to a sp ecic object requested from a server. For an
HTTP server, for instance, a target is sp ecied byaURLand
any applicable arguments to the HTTP
GET
command.
1

Figure 1: Locality-Aware Request Distributio n
targets will arrive at b oth back-ends. This increases the
likelihoo d of a cache miss, if the sum of the sizes of the
three targets, or, more generally, if the size of the work-
ing set exceeds the size of the main memory cache at an
individual back-end no de.
Of course, bynaively distributin g incoming requests
in a content-based manner as suggested in Figure 1, the
load between dierentback-ends might become unbal-
anced, resulting in worse p erformance. The rst ma-
jor challenge in building a LARD cluster is therefore to
design a practical and ecient strategy that
simultane-
ously
achieves load balancing and high cache hit rates
on the back-ends. The second challenge stems from the
need for a proto col that allows the front-end to hand o
an established client connection to a back-end no de, in
a manner that is transparent to clients and is ecient
enough not to render the front-end a b ottleneck. This
requirement results from the front-end's need to inspect
the target content of a request
prior
to assigning the
request to a back-end node. This paper demonstrates
that these challenges can b e met, and that LARD pro-
duces substantially higher throughput than the state-of-
the-art approaches where request distribution is solely
based on load balancing, for workloads whose working
set exceeds the size of the individual node caches.
Increasing a server's cache eectiveness is an impor-
tant step towards meeting the demands placed on cur-
rent and future network servers. Being able to cache the
working set is critical to achieving high throughput, as
a state-of-the-art disk device can deliver no more than
120 blo ck requests/sec, while high-end network serv
ers
will be expected to serve thousands of documentre-
quests per second. Moreover, typical working set sizes
of web servers can be expected to growover time, for
two reasons. First, the amountofcontentmadeavail-
able by a single organization is typically growing over
time. Second, there is a trend towards centralization
of web servers within organizations. Issues suchascost
and ease of administration, availability, security,and
high-capacitybackbone network access cause organiza-
tions to movetowards large, centralized network servers
that handle all of the organization's web presence. Such
servers have to handle the combined working sets of all
the servers they sup ersede.
With round-robin distribution, a cluster does not
scale well to larger working sets, as
each
node's main
memory cache has to t the entire working set. With
LARD, the eectivecache size approaches the
sum
of
the node cache sizes. Thus, adding nodes to a cluster
can accommodate both increased trac (due to addi-
tional CPU power) and larger working sets (due to the
increased eective cache size).
This paper presents the following contributions:
1. a practical and ecient LARD strategy that achieves
high cache hit rates and go od load balancing,
2. a trace-driven simulation that demonstrates the per-
formance p otential of locality-aware request distribu-
tion,
3. an ecient
TCP hando protocol
, that enables
content-based request distributio n byproviding client-
transparent connection hando for TCP-based network
services, and
4. a performance evaluation of a prototype LARD
server cluster, incorporating the TCP hando protocol
and the LARD strategy.
The outline of the rest of this pap er is as follows:
In Section 2 wedevelop our strategy for locality-aware
request distribution. In Section 3 we describe the model
used to simulate the performance of LARD in compari-
son to other request distribution strategies. In Section 4
we present the results of the simulation. In Section 5
wemove on to the practical implementation of LARD,
particularly the TCP hando proto col. We describe the
experimental environmentinwhich our LARD server
is implemented and its measured p erformance in Sec-
tion 6. We describ e related work in Section 7 and we
conclude in Section 8.
2 Strategies for Request Distribution
2.1 Assumptions
The following assumptions hold for all request distribu-
tion strategies considered in this pap er:
The front-end is responsible for handing o new con-
nections and passing incoming data from the clientto
the back-end no des. As a result, it must keep trackof
open and closed connections, and it can use this infor-
mation in making load balancing decisions. The front-
end is not involved in handling outgoing data, whichis
sent directly from the back-ends to the clients.
The front-end limits the number of outstanding re-
quests at the back-ends. This approachallows the front-
end more exibility in resp onding to changing load on
the back-ends, since waiting requests can be directed to
back-ends as capacity becomes available. In contrast,
if we queued requests only on the back-end no des, a
slow node could cause many requests to b e delayed even
though other no des mighthave free capacity.
Anyback-end no de is capable of serving any target,
although in certain request distribution strategies, the
front-end may direct a request only to a subset of the
back-ends.
2.2 Aiming for Balanced Load
In state-of-the-art cluster servers, the front-end uses
weightedround-robin
request distribution 7, 14]. The
2

incoming requests are distributed in round-robin fash-
ion, weighted by some measure of the load on the dier-
ent back-ends. For instance, the CPU and disk utiliza-
tion, or the number of open connections in eachback-
end may be used as an estimate of the load.
This strategy produces goo d load balancing among
the back-ends. However, since it does not consider the
type of service or requested documentinchoosing a
back-end no de, eachback-end no de is equally likely to
receive a given type of request. Therefore, eachback-
end no de receives an approximately identical working
set of requests, and caches an approximately identical
set of do cuments. If this working set exceeds the size of
main memory available for caching do cuments, frequent
cache misses will o ccur.
2.3 Aiming for Lo cality
In order to improve locality in the back-end's cache,
a simple front-end strategy consists of partitioning the
name space of the database in some way, and assign-
ing request for all targets in a particular partition to a
particular back-end. For instance, a hash function can
be used to p erform the partitioning. We will call this
strategy
locality-based
LB].
A good hashing function partitions b oth the name
space and the working set more or less evenly among the
back-ends. If this is the case, the cache in eachback-end
should achieveamuch higher hit rate, since it is only
trying to cache its subset of the working set, rather than
the entire working set, as with load balancing based
approaches. What is a go od partitioning for locality
may,however, easily prove a poor choice of partitioning
for load balancing. For example, if a small set of targets
in the working set account for a large fraction of the
incoming requests, the back-ends serving those targets
will b e far more loaded than others.
2.4 Basic Locality-Aware Request Distribution
The goal of LARD is to combine goo d load balancing
and high locality.We develop our strategy in two steps.
The basic strategy, described in this subsection, always
assigns a single back-end no de to serve a given target,
thus making the idealized assumption that a single tar-
get cannot by itself exceed the capacity of one no de.
This restriction is removed in the next subsection, where
we present the complete strategy.
Figure 2 presents pseudo-co de for the basic LARD.
The front-end maintains a one-to-one mapping of tar-
gets to back-end nodes in the
server
array. When the
rst request arrives for a given target, it is assigned a
back-end node bychoosing a lightly loaded back-end
node. Subsequent requests are directed to a target's as-
signed back-end no de, unless that no de is overloaded.
In the latter case, the target is assigned a new back-end
node from the current set of lightly loaded nodes.
A no de's load is measured as the number of active
connections, i.e., connections that have been handed o
to the no de, hav
enotyet completed, and are show-
ing request activity. Observe that an overloaded node
will fall behind and the resulting queuing of requests
will cause its number of active connections to increase,
while the number of active connections at an under-
loaded node will tend to zero. Monitoring the relative
while (true)
fetch next request r
if serverr.target] = null then
n, serverr.target]
f
least loaded no de
g
else
n
serverr.target]
if (n.load
>T
high
&&
9
node with load
<T
low
)
jj
n.load
2
T
high
then
n, serverr.target]
f
least loaded node
g
sendrton
Figure 2: The Basic LARD Strategy
number of active connections allows the front-end to es-
timate the amount of \outstanding work" and thus the
relative load on a back-end without requiring explicit
communication with the back-end node.
The intuition for the basic LARD strategy is as fol-
lows: The distribution of targets when they are rst re-
quested leads to a partitioning of the name space of the
database, and indirectly to a partitioning of the working
set, much in the same way as with the strategy purely
aiming for lo cality. It also derives similar localitygains
from doing so. Only when there is a signicant load im-
balance do we diverge from this strategy and re-assign
targets. The denition of a \signican
t load imbalance"
tries to reconcile two competing goals. On one hand, we
do not want greatly diverging load values on dierent
back-ends. On the other hand, given the cache misses
and disk activity resulting from re-assignment, wedo
not want to re-assign targets to smo oth out only minor
or temporary load imbalances. It suces to make sure
that no node has idle resources while another back-end
is dropping behind.
We dene
T
low
as the load (in number of active con-
nections) belowwhich a back-end is likely to have idle
resources. We dene
T
high
as the load above whicha
node is likely to cause substantial delay in serving re-
quests. If a situation is detected where a node has a
load larger than
T
high
while another no de has a load
less than
T
low
,atargetismoved from the high-load to
the low-load back-end. In addition, to limit the delay
variance among dierent nodes, once a node reaches a
load of 2
T
high
, a target is moved to a less loaded no de,
even if no no de has a load of less than
T
low
.
If the front-end did not limit the total number of ac-
tive connections admitted into the cluster, the load on
all nodes could rise to 2
T
high
, and LARD would then
behavelike WRR. To prevent this, the front-end lim-
its the sum total of connections handed to all back-end
nodes to the value
S
=(
n
;
1)
T
high
+
T
low
;
1, where
n
is the number of back-end nodes. Setting
S
to this
value ensures that at most
n
;
2nodescanhaveaload
T
high
while no node has load
<T
low
.At the same
time, enough connections are admitted to ensure all
n
nodes can have a load above
T
low
(i.e., b e fully utilized)
and still leave ro om for a limited amount of load imbal-
ance between the nodes (to prevent unnecessary target
reassignments in the interest of locality).
The two conditions for deciding when to moveatar-
get attempt to ensure that the cost of moving is incurred
only when the load dierence is substantial enough to
warrant doing so. Whenever a target gets reassigned,
our two tests combined with the denition of
S
ensure
that the load dierence between the old and new tar-
3

gets is at least
T
high
;
T
low
.To see this, note that the
denition of
S
implies that there must always exist a
node with a load
<T
high
. The maximal load imbalance
that can arise is 2
T
high
;
T
low
.
The appropriate setting for
T
low
depends on the
speed of the back-end no des. In practice,
T
low
should be
chosen high enough to avoid idle resources on back-end
nodes, which could cause throughput loss. Given
T
low
,
choosing
T
high
involves a tradeo.
T
high
;
T
low
should
be low enough to limit the delayvariance among the
back-ends to acceptable levels, but high enough to tol-
erate limited load imbalance and short-term load uc-
tuations without destroying locality.
Simulations to test the sensitivity of our strategy to
these parameter settings show that the maximal delay
dierence increases approximately linearly with
T
high
;
T
low
. The throughput increases mildly and eventually
attens as
T
high
;
T
low
increases. Therefore,
T
high
should
be set to the largest p ossible value that still satises the
desired b ound on the delay dierence b etween back-end
nodes. Given a desired maximal delay dierence of
D
secs and an average request service time of
R
secs,
T
high
should be set to (
T
low
+
D=R
)
=
2, subject to the obvi-
ous constraint that
T
high
>T
low
. The setting of
T
low
can be conservatively high with no adverse impact on
throughput and only a mild increase in the average de-
lay.Furthermore, if desired, the setting of
T
low
can b e
easily automated by requesting explicit load information
from the back-end nodes during a \training phase". In
our simulations and in the prototype, wehave found set-
tings of
T
low
=25 and
T
high
= 65 active connections to
give go od p erformance across all workloads we tested.
2.5 LARD with Replication
A potential problem with the basic LARD strategy is
that a given target is served by only a single node at any
given time. However, if a single target causes a back-end
to go into an overload situation, the desirable action is
to assign several back-end no des to serve that document,
and to distribute requests for that target among the
serving no des. This leads us to the second version of
our strategy, whichallows replication.
Pseudo-code for this strategy is shown in Figure 3.
It diers from the original one as follows: The front-end
maintains a mapping from targets to a
set
of nodes that
serve the target. Requests for a target are assigned to
the least loaded no de in the target's server set. If a load
imbalance o ccurs, the front-end checks if the requested
document's server set has changed recently (within
K
seconds). If so, it picks a lightly loaded node and adds
that no de to the server set for the target. On the other
hand, if a request target has multiple servers and has
not moved or had a server node added for some time
(
K
seconds), the front-end removes one no de from the
target's server set. This ensures that the degree of repli-
cation for a target does not remain unnecessarily high
once it is requested less often. In our exp eriments, we
used values of
K
= 20 secs.
2.6 Discussion
As will be seen in Sections 4 and 6, the LARD strate-
gies result in a go od combination of load balancing and
locality. In addition, the strategies outlined abovehave
while (true)
fetch next request r
if serverSetr.target] =
then
n, serverSetr.target]
f
least loaded node
g
else
n
f
least loaded no de in serverSetr.target]
g
m
f
most loaded no de in serverSetr.target]
g
if (n.load
>T
high
&&
9
node with load
<T
low
)
jj
n.load
2
T
high
then
p
f
least loaded no de
g
add p to serverSetr.target]
n
p
if
j
serverSetr.target]
j
>
1&&
time() - serverSetr.target].lastMod
>
Kthen
remove m from serverSetr.target]
sendrton
if serverSetr.target] changed in this iteration then
serverSetr.target].lastMod
time()
Figure 3: LARD with Replication
several desirable features. First, they do not require
any extra communication between the front-end and the
back-ends. Second, the front-end need not keep track
of any frequency of access information or try to model
the contents of the caches of the back-ends. In particu-
lar, the strategy is independent of the lo cal replacement
policy used bytheback-ends. Third, the absence of
elaborate state in the front-end makes it rather straight-
forward to recover from a back-end node failure. The
front-end simply re-assigns targets assigned to the failed
back-end as if they had not been assigned before. For
all these reasons, we argue that the proposed strategy
can b e implemented without undue complexity.
In a simple implementation of the two strategies, the
size of the
server
or
serverSet
arrays, resp ectively, can
growtothenumber of targets in the server's database.
Despite the low storage overhead p er target, this can
be of concern in servers with very large databases. In
this case, the mappings can b e maintained in an LRU
cache, where assignments for targets that have not b een
accessed recently are discarded. Discarding mappings
for such targets is of little consequence, as these targets
have most likely b een evicted from the back-end no des'
caches anyway.
3 Simulation
To study various request distribution p olicies for a range
of cluster sizes under dierent assumptions for CPU
speed, amount of memory,number of disks and other
parameters, we developed a congurable web server clus-
ter simulator. We also implemented a prototype of a
LARD-based cluster, which is described in Section 6.
3.1 Simulation Model
The simulation mo del is depicted in Figure 4. Each
back-end no de consists of a CPU and locally-attached
disk(s), with separate queues for each. In addition, each
node maintains its own main memory cache of con-
gurable size and replacement p olicy. For simplicity,
caching is p erformed on a whole-le basis.
Processing a request requires the following steps:
4

Figure 4: Cluster Simulation Mo del
connection establishment, disk reads (if needed), target
data transmission, and connection teardown. The as-
sumption is that front-end and networks are fast enough
not to limit the cluster's p erformance, thus fully expos-
ing the throughput limits of the back-ends. Therefore,
the front-end is assumed to havenooverhead and all
networks have innite capacity in the simulations.
The individ ual processing steps for a given request
must b e performed in sequence, but the CPU and disk
times for diering requests can be overlapped. Also,
large le reads are blo cked, such that the data transmis-
sion immediately follows the disk read for eachblock.
Multiple requests waiting on the same le from disk
can b e satised with only one disk read, since all the re-
quests can access the data once it is cachedinmemory.
The costs for the basic request pro cessing steps
used in our simulations were derived by p erforming
measurements on a 300 Mhz Pentium I I machine run-
ning FreeBSD 2.2.5 and an aggressive exp erimental web
server. Connection establishment and teardown costs
are set at 145
s of CPU time each, while transmit pro-
cessing incurs 40
s per 512 bytes. Using these num-
bers, an 8 KByte do cument can be served from the
main memory cache at a rate of approximately 1075
requests/sec.
If disk access is required, reading a le from disk has
a latency of 28 ms (2 seeks + rotational latency). The
disk transfer time is 410
s p er 4 KByte (resulting in
approximately 10 MBytes/sec p eak transfer rate). For
les larger than 44 KBytes, an additional 14 ms (seek
plus rotational latency) is charged for every 44 KBytes
of le length in excess of 44 KBytes. 44 KBytes was
measured as the av
erage disk transfer size between seeks
in our experimental server. Unless otherwise stated,
each back-end node has one disk.
The cache replacement policy wechose for all sim-
ulations is Greedy-Dual-Size (GDS), as it appears to
be the b est known policy for Web workloads 5]. We
have also performed simulations with LRU, where les
with a size of more than 500KB are never cached. The
relative p erformance of the various distribution strate-
gies remained largely unaected. However, the absolute
throughput results were up to 30% lower with LRUthan
with GDS.
3.2 Simulation Inputs
The input to the simulator is a stream of tokenized tar-
get requests, where each token represents a unique tar-
get b eing served. Associated with eachtoken is a target
size in bytes. This tokenized stream can be syntheti-
cally created, or it can be generated by pro cessing logs
from existing web servers.
One of the traces we use was generated bycombin-
ing logs from multiple departmental web servers at Rice
University. This trace spans a two-month p eriod. An-
other trace comes from IBM Corp oration's main web
server (www.ibm.com) and represents server logs for a
perio d of 3.5 days starting at midnight, June 1, 1998.
Figures 5 and 6 show the cumulative distributions of
request frequency and size for the Rice University trace
and the IBM trace, respectively.Shown on the x-axis
is the set of target les in the trace, sorted in decreas-
ing order of request frequency. The y-axis shows the
cumulative fraction of requests and target sizes, nor-
malized to the total number of requests and total data
set size, respectively. The data set for the Rice Univer-
sity trace consist of 37703 targets covering 1418 MB of
space, whereas the IBM trace consists of 38527 targets
and 1029 MB of space. While the data sets in b oth
traces are of a comparable size, it is evident from the
graphs that the Rice trace has much less locality than
the IBM trace. In the Rice trace, 560/705/927 MB of
memory is needed to cover 97/98/99% of all requests,
respectively, while only 51/80/182 MB are needed to
cover the same fractions of requests in the IBM trace.
This dierence is likely to b e caused in part by the
dierent time spans that each trace covers. Also, the
IBM trace is from a single high-trac server, where the
content designers havelikely sp ent eort to minimize
the sizes of high frequency documents in the interest of
performance. The Rice trace, on the other hand, was
merged from the logs of several departmental servers.
As with all caching studies, interesting eects can
only be observed if the size of the working set exceeds
that of the cache. Since even our larger trace has a rel-
atively small data set (and thus a small working set),
and also to anticipate future trends in working set sizes,
wechose to set the default no de cache size in our simu-
lations to 32 MB. Since in reality, the cache has to share
main memory with OS kernel and server applications,
this typically requires at least 64 MB of memory in an
actual server node.
3.3 Simulation Outputs
The simulator calculates overall throughput, hit rate,
and underutilization time. Throughput is the number
5

Citations
More filters
Proceedings ArticleDOI
21 Oct 2001
TL;DR: Experimental results from a prototype confirm that the system adapts to offered load and resource availability, and can reduce server energy usage by 29% or more for a typical Web workload.
Abstract: Internet hosting centers serve multiple service sites from a common hardware base. This paper presents the design and implementation of an architecture for resource management in a hosting center operating system, with an emphasis on energy as a driving resource management issue for large server clusters. The goals are to provision server resources for co-hosted services in a way that automatically adapts to offered load, improve the energy efficiency of server clusters by dynamically resizing the active server set, and respond to power supply disruptions or thermal events by degrading service in accordance with negotiated Service Level Agreements (SLAs).Our system is based on an economic approach to managing shared server resources, in which services "bid" for resources as a function of delivered performance. The system continuously monitors load and plans resource allotments by estimating the value of their effects on service performance. A greedy resource allocation algorithm adjusts resource prices to balance supply and demand, allocating resources to their most efficient use. A reconfigurable server switching infrastructure directs request traffic to the servers assigned to each service. Experimental results from a prototype confirm that the system adapts to offered load and resource availability, and can reduce server energy usage by 29% or more for a typical Web workload.

1,492 citations

Journal ArticleDOI
TL;DR: A novel dynamic provisioning technique for multi-tier Internet applications that employs a flexible queuing model to determine how much of the resources to allocate to each tier of the application, and a combination of predictive and reactive methods that determine when to provision these resources, both at large and small time scales is proposed.
Abstract: Dynamic capacity provisioning is a useful technique for handling the multi-time-scale variations seen in Internet workloads. In this article, we propose a novel dynamic provisioning technique for multi-tier Internet applications that employs (1) a flexible queuing model to determine how much of the resources to allocate to each tier of the application, and (2) a combination of predictive and reactive methods that determine when to provision these resources, both at large and small time scales. We propose a novel data center architecture based on virtual machine monitors to reduce provisioning overheads. Our experiments on a forty-machine Xen/Linux-based hosting platform demonstrate the responsiveness of our technique in handling dynamic workloads. In one scenario where a flash crowd caused the workload of a three-tier application to double, our technique was able to double the application capacity within five minutes, thus maintaining response-time targets. Our technique also reduced the overhead of switching servers across applications from several minutes to less than a second, while meeting the performance targets of residual sessions.

554 citations

Journal ArticleDOI
TL;DR: This article classifies and describes main mechanisms to split the traffic load among the server nodes, discussing both the alternative architectures and the load sharing policies.
Abstract: The overall increase in traffic on the World Wide Web is augmenting user-perceived response times from popular Web sites, especially in conjunction with special events. System platforms that do not replicate information content cannot provide the needed scalability to handle large traffic volumes and to match rapid and dramatic changes in the number of clients. The need to improve the performance of Web-based services has produced a variety of novel content delivery architectures. This article will focus on Web system architectures that consist of multiple server nodes distributed on a local area, with one or more mechanisms to spread client requests among the nodes. After years of continual proposals of new system solutions, routing mechanisms, and policies (the first dated back to 1994 when the NCSA Web site had to face the first million of requests per day), many problems concerning multiple server architectures for Web sites have been solved. Other issues remain to be addressed, especially at the network application layer, but the main techniques and methodologies for building scalable Web content delivery architectures placed in a single location are settled now. This article classifies and describes main mechanisms to split the traffic load among the server nodes, discussing both the alternative architectures and the load sharing policies. To this purpose, it focuses on architectures, internal routing mechanisms, and dispatching request algorithms for designing and implementing scalable Web-server systems under the control of one content provider. It identifies also some of the open research issues associated with the use of distributed systems for highly accessed Web sites.

525 citations


Cites background or methods from "Locality-aware request distribution..."

  • ...Once the Web switch has established the TCP connec­tion with the client and selected the tar­get server, it hands off its endpoint of the TCP connection to the server, which can communicate directly with the client [Pai et al. 1998]....

    [...]

  • ...The Locality-Aware Request Distribu­tion (LARD) policy is a content-aware re­quest distribution that considers both lo­cality and load balancing [Aron et al. 1999; Pai et al. 1998]....

    [...]

  • ...2002] L5 [Apostolopoulos et al. 2000a] [Yang and Luo 2000] Array 500 [Array Networks 2002] Network Dispatcher kernel-level CBR [IBM 2002] ScalaServer [Aron et al. 1999] [Pai et al. 1998] [Tang et al. 2001] ClubWeb [Andreolini et al. 2001] Central Dispatch [Resonate 2002] hardware box with a modi.ed BSDi-Unix kernel, while Web Switch [Lucent Tech....

    [...]

  • ...Once the Web switch has established the TCP connection with the client and selected the target server, it hands o its endpoint of the TCP connection to the server [79]....

    [...]

  • ...Two-way One-way TCP gateway TCP splicing TCP hando TCP connection hop IBM Network [34] ScalaServer [8, 79] Resonate's Dispatcher CBR [61] Central Dispatch [86] CAP [27] Nortel Networks' Web OS SLB [76] HACC [101] Foundry Networks' ServerIron [51] Cisco's CSS [33] F5 Networks' BIG/ip [48] Radware's WSD Pro+ [85] HydraWEB's Hydra2500 [60] Zeus's Load Balancer [100] [98]...

    [...]

Proceedings Article
06 Jun 1999
TL;DR: This paper presents the design of a new Web server architecture called the asymmetric multi-process event-driven (AMPED) architecture, and evaluates the performance of an implementation of this architecture, the Flash Web server.
Abstract: This paper presents the design of a new Web server architecture called the asymmetric multi-process event-driven (AMPED) architecture, and evaluates the performance of an implementation of this architecture, the Flash Web server. The Flash Web server combines the high performance of single-process event-driven servers on cached workloads with the performance of multiprocess and multi-threaded servers on disk-bound workloads. Furthermore, the Flash Web server is easily portable since it achieves these results using facilities available in all modern operating systems. The performance of different Web server architectures is evaluated in the context of a single implementation in order to quantify the impact of a server's concurrency architecture on its performance. Furthermore, the performance of Flash is compared with two widely-used Web servers, Apache and Zeus. Results indicate that Flash can match or exceed the performance of existing Web servers by up to 50% across a wide range of real workloads. We also present results that show the contribution of various optimizations embedded in Flash.

396 citations


Cites methods from "Locality-aware request distribution..."

  • ...In this area, some work has focused on using multiple server nodes in parallel [6, 10, 13, 16, 19, 28 ], or sharing memory across machines [12, 15, 21]....

    [...]

Proceedings ArticleDOI
13 Jun 2005
TL;DR: A novel dynamic provisioning technique for multitier Internet applications that employs a flexible queuing model to determine how much resources to allocate to each tier of the application, and a combination of predictive and reactive methods that determine when to provision these resources, both at large and small time scales is proposed.
Abstract: Dynamic capacity provisioning is a useful technique for handling the multi-time-scale variations seen in Internet workloads. In this paper, we propose a novel dynamic provisioning technique for multitier Internet applications that employs (i) a flexible queuing model to determine how much resources to allocate to each tier of the application, and (ii) a combination of predictive and reactive methods that determine when to provision these resources, both at large and small time scales. Our experiments on a forty-machine Linux-based hosting platform demonstrate the responsiveness of our technique in handling dynamic workloads. In one scenario where a flash crowd caused the workload of a three-tier application to double, our technique was able to double the application capacity within five minutes, thus maintaining response time targets

364 citations


Cites background from "Locality-aware request distribution..."

  • ...The workload predictor outlined in the previous section is not perfect—it may incur prediction errors if the workload on a given day deviates from its behavior on previous days....

    [...]

References
More filters
Proceedings Article
08 Dec 1997
TL;DR: GreedyDual-Size as discussed by the authors incorporates locality with cost and size concerns in a simple and nonparameterized fashion for high performance, which can potentially improve the performance of main-memory caching of Web documents.
Abstract: Web caches can not only reduce network traffic and downloading latency, but can also affect the distribution of web traffic over the network through cost-aware caching. This paper introduces GreedyDual-Size, which incorporates locality with cost and size concerns in a simple and non-parameterized fashion for high performance. Trace-driven simulations show that with the appropriate cost definition, GreedyDual-Size outperforms existing web cache replacement algorithms in many aspects, including hit ratios, latency reduction and network cost reduction. In addition, GreedyDual-Size can potentially improve the performance of main-memory caching of Web documents.

1,048 citations


"Locality-aware request distribution..." refers methods in this paper

  • ...The cache replacement policy we chose for all simulations is Greedy-Dual-Size (GDS), as it appears to be the best known policy for Web workloads [5]....

    [...]

ReportDOI
22 Jan 1996
TL;DR: The design and performance of a hierarchical proxy-cache designed to make Internet information systems scale better are discussed, and performance measurements indicate that hierarchy does not measurably increase access latency.
Abstract: This paper discusses the design and performance of a hierarchical proxy-cache designed to make Internet information systems scale better. The design was motivated by our earlier trace-driven simulation study of Internet traffic. We challenge the conventional wisdom that the benefits of hierarchical file caching do not merit the costs, and believe the issue merits reconsideration in the Internet environment. The cache implementation supports a highly concurrent stream of requests. We present performance measurements that show that our cache outperforms other popular Internet cache implementations by an order of magnitude under concurrent load. These measurements indicate that hierarchy does not measurably increase access latency. Our software can also be configured as a Web-server accelerator; we present data that our httpd-accelerator is ten times faster than Netscape's Netsite and NCSA 1.4 servers. Finally, we relate our experience fitting the cache into the increasingly complex and operational world of Internet information systems, including issues related to security, transparency to cache-unaware clients, and the role of file systems in support of ubiquitous wide-area information systems.

853 citations

Book
01 Jan 1989
TL;DR: This book describes the design and implementation of the BSD operating system--previously known as the Berkeley version of UNIX, and is widely used for Internet services and firewalls, timesharing, and multiprocessing systems.
Abstract: The first authoritative description of Berkeley UNIX, its design and implementation. Book covers the internal structure of the 4.3 BSD systems and the concepts, data structures and algorithms used in implementing the system facilities. Chapter on TCP/IP. Annotation copyright Book News, Inc. Portlan

702 citations

Proceedings ArticleDOI
01 Oct 1997
TL;DR: A general, layered architecture for building cluster-based scalable network services that encapsulates the above requirements for reuse, and a service-programming model based on composable workers that perform transformation, aggregation, caching, and customization (TACC) of Internet content is proposed.
Abstract: We identify three fundamental requirements for scalable network services: incremental scalability and overflow growth provisioning, 24x7 availability through fault masking, and cost-effectiveness. We argue that clusters of commodity workstations interconnected by a high-speed SAN are exceptionally well-suited to meeting these challenges for Internet-server workloads, provided the software infrastructure for managing partial failures and administering a large cluster does not have to be reinvented for each new service. To this end, we propose a general, layered architecture for building cluster-based scalable network services that encapsulates the above requirements for reuse, and a service-programming model based on composable workers that perform transformation, aggregation, caching, and customization (TACC) of Internet content. For both performance and implementation simplicity, the architecture and TACC programming model exploit BASE, a weaker-than-ACID data semantics that results from trading consistency for availability and relying on soft state for robustness in failure management. Our architecture can be used as an off the shelf infrastructural platform for creating new network services, allowing authors to focus on the content of the service (by composing TACC building blocks) rather than its implementation. We discuss two real implementations of services based on this architecture: TranSend, a Web distillation proxy deployed to the UC Berkeley dialup IP population, and HotBot, the commercial implementation of the Inktomi search engine. We present detailed measurements of TranSend's performance based on substantial client traces, as well as anecdotal evidence from the TranSend and HotBot experience, to support the claims made for the architecture.

666 citations

Book
01 Mar 1991
TL;DR: The Berkeley version of UNIX (BSD) as discussed by the authors is a popular operating system for Internet services and firewalls, timesharing, and multiprocessing systems.
Abstract: This book describes the design and implementation of the BSD operating system--previously known as the Berkeley version of UNIX. Today, BSD is found in nearly every variant of UNIX, and is widely used for Internet services and firewalls, timesharing, and multiprocessing systems. Readers involved in technical and sales support can learn the capabilities and limitations of the system; applications developers can learn effectively and efficiently how to interface to the system; systems programmers can learn how to maintain, tune, and extend the system. Written from the unique perspective of the system's architects, this book delivers the most comprehensive, up-to-date, and authoritative technical information on the internal structure of the latest BSD system.As in the previous book on 4.3BSD (with Samuel Leffler), the authors first update the history and goals of the BSD system. Next they provide a coherent overview of its design and implementation. Then, while explaining key design decisions, they detail the concepts, data structures, and algorithms used in implementing the system's facilities. As an in-depth study of a contemporary, portable operating system, or as a practical reference, readers will appreciate the wealth of insight and guidance contained in this book.Highlights of the book: Details major changes in process and memory management Describes the new extensible and stackable filesystem interface Includes an invaluable chapter on the new network filesystem Updates information on networking and interprocess communication

639 citations

Frequently Asked Questions (13)
Q1. What contributions have the authors mentioned in the paper "Locality aware request distribution in cluster based network servers" ?

The authors consider cluster based network servers in which a front end directs incoming requests to one of a num ber of back ends Speci cally they consider content based request distribution the front end uses the content re quested in addition to information about the load on the back end nodes to choose which back end will han dle this request Content based request distribution can improve locality in the back ends main memory caches increase secondary storage scalability by partitioning the server s database and provide the ability to employ back end nodes that are specialized for certain types of requests As a speci c policy for content based request dis tribution the authors introduce a simple practical strategy for locality aware request distribution LARD With LARD the front end distributes incoming requests in a manner that achieves high locality in the back ends main memory caches as well as load balancing Local ity is increased by dynamically subdividing the server s working set over the back ends To this end the authors introduce an e cient TCP hando protocol that can hand o an es tablished TCP connection in a client transparent man ner To appear in the Proceedings of the Eighth International Conference on Architectural Support for Programming Lan guages and Operating Systems ASPLOS VIII San Jose CA Oct Introduction Network servers based on clusters of commodity work stations or PCs connected by high speed LANs combine cutting edge performance and low cost A cluster based network server consists of a front end responsible for re quest distribution and a number of back end nodes re sponsible for request processing The locality aware request distribution LARD strat egy presented in this paper is a form of content based request distribution focusing on obtaining the rst of the advantages cited above namely improved cache hit rates in the back ends Secondary storage scalability and special purpose back end nodes are not discussed any further in this paper Figure illustrates the principle of LARD in a simple server with two back ends and three targets A B C in the incoming request stream In the following discussion the term target is being used to refer to a speci c object requested from a server For an HTTP server for instance a target is speci ed by a URL and any applicable arguments to the HTTP GET command Figure Locality Aware Request Distribution targets will arrive at both back ends This increases the likelihood of a cache miss if the sum of the sizes of the three targets or more generally if the size of the work ing set exceeds the size of the main memory cache at an individual back end node This paper demonstrates that these challenges can be met and that LARD pro duces substantially higher throughput than the state of the art approaches where request distribution is solely based on load balancing for workloads whose working set exceeds the size of the individual node caches Increasing a server s cache e ectiveness is an impor tant step towards meeting the demands placed on cur rent and future network servers This paper presents the following contributions a practical and e cient LARD strategy that achieves high cache hit rates and good load balancing a trace driven simulation that demonstrates the per formance potential of locality aware request distribu tion an e cient TCP hando protocol that enables content based request distribution by providing client transparent connection hando for TCP based network services and a performance evaluation of a prototype LARD server cluster incorporating the TCP hando protocol and the LARD strategy The outline of the rest of this paper is as follows In Section the authors describe the model used to simulate the performance of LARD in compari son to other request distribution strategies In Section the authors present the results of the simulation The authors describe the experimental environment in which their LARD server is implemented and its measured performance in Sec tion The authors describe related work in Section and they conclude in Section Strategies for Request Distribution The potential advantages of content based re quest distribution are increased performance due to improved hit rates in the back end s main memory caches increased secondary storage scalability due to the ability to partition the server s database over the di erent back end nodes and the ability to employ back end nodes that are specialized for certain types of requests e g audio and video Of course by naively distributing incoming requests in a content based manner as suggested in Figure the load between di erent back ends might become unbal anced resulting in worse performance 

The cache replacement policy the authors chose for all sim ulations is Greedy Dual Size GDS as it appears to be the best known policy for Web workloads 

The Rice trace requires the combined cache size of eight to ten nodes to hold the working set Since WRR cannot aggregate the cache size the server remains disk bound for all cluster sizes LARD and LARD R on the other hand cause the system to be come increasingly CPU bound for eight or more nodes resulting in superlinear speedup in the node re gion with linear but steeper speedup for more than ten nodes 

The module op erates directly above the network interface and executes in the context of the network interface interrupt han dler A simple hash table lookup is required to deter mine whether a packet should be forwarded 

The dispatcher on the other hand requires shared state and thus synchronization among the CPUs However with a simple policy such as LARD R the time spent in the dispatcher amounts to only a small fraction of the hando overhead 

Also large le reads are blocked such that the data transmis sion immediately follows the disk read for each block Multiple requests waiting on the same le from disk can be satis ed with only one disk read since all the re quests can access the data once it is cached in memory 

In the Rice trace MB of memory is needed to cover of all requests respectively while only MB are needed to cover the same fractions of requests in the IBM trace 

This can be expected as the in creased cache e ectiveness of LARD R causes a reduced dependence on disk speedWRR on the other hand greatly bene ts from mul tiple disks as its throughput is mainly bound by the performance of the disk subsystem 

In their nal set of simulations the authors explore the impact of using multiple disks in each back end node on the rel ative performance of LARD R versus WRR Figures and respectively show the throughput results for WRR and LARD R on the combined Rice University trace with di erent numbers of disks per back end node With LARD R a second disk per node yields a mild throughput gain but additional disks do not achieve any further bene t 

The authors performed simulations on the Rice trace with the default CPU speed setting ex plained in Section and with twice three and four times the default speed setting 

The authors have not been able to measure such high throughput directly due to lack of network resources but the measured remaining CPU idle time in the front end at lower throughput is consistent with this gure Further measurements indicate that with the Rice Uni versity trace as the workload the hando throughput and forwarding throughput are su cient to support back end nodes of the same CPU speed as the front endMoreover the front end can be relatively easily scaled to larger clusters either by upgrading to a faster CPU or by employing an SMP machine Connection estab lishment hando and forwarding are independent for di erent connections and can be easily parallelized 

Their proposal addresses the com plementary issue of providing support for cost e ective scalable network serversNetwork servers based on clusters of workstations are starting to be widely used Several products are available or have been announced for use as front end nodes in such cluster servers 

This can be clearly seen in the at cache miss ratio curve for WRRAs expected both LB schemes achieve a decrease in cache miss ratio as the number of nodes increases