Journal Article•DOI•

Locality-aware request distribution in cluster-based network servers

Vivek S. Pai¹, Mohit Aron¹, Gaurov Banga¹, Michael Svendsen¹, Peter Druschel¹, Willy Zwaenepoel¹, Erich M. Nahum² - Show less +3 more•Institutions (2)

Rice University¹, IBM²

01 Oct 1998-Vol. 33, Iss: 11, pp 205-216

TL;DR: A simple, practical strategy for locality-aware request distribution (LARD), in which the front-end distributes incoming requests in a manner that achieves high locality in the back-ends' main memory caches as well as load balancing.

read less

Abstract: We consider cluster-based network servers in which a front-end directs incoming requests to one of a number of back-ends. Specifically, we consider content-based request distribution: the front-end uses the content requested, in addition to information about the load on the back-end nodes, to choose which back-end will handle this request. Content-based request distribution can improve locality in the back-ends' main memory caches, increase secondary storage scalability by partitioning the server's database, and provide the ability to employ back-end nodes that are specialized for certain types of requests.As a specific policy for content-based request distribution, we introduce a simple, practical strategy for locality-aware request distribution (LARD). With LARD, the front-end distributes incoming requests in a manner that achieves high locality in the back-ends' main memory caches as well as load balancing. Locality is increased by dynamically subdividing the server's working set over the back-ends. Trace-based simulation results and measurements on a prototype implementation demonstrate substantial performance improvements over state-of-the-art approaches that use only load information to distribute requests. On workloads with working sets that do not fit in a single server node's main memory cache, the achieved throughput exceeds that of the state-of-the-art approach by a factor of two to four.With content-based distribution, incoming requests must be handed off to a back-end in a manner transparent to the client, after the front-end has inspected the content of the request. To this end, we introduce an efficient TCP handoflprotocol that can hand off an established TCP connection in a client-transparent manner.

...read moreread less

Summary (3 min read)

Jump to: [2.2 Aiming for Balanced Load] – [2.3 Aiming for Locality] – [2.4 Basic Locality-Aware Request Distribution] – [2.5 LARD with Replication] – [2.6 Discussion] – [3.1 Simulation Model] – [3.3 Simulation Outputs] – [4 Simulation Results] – [4.2 Other Workloads] – [4.4 Delay] – [6.3 Cluster Performance Results] and [8 Conclusion]

2.2 Aiming for Balanced Load

This strategy produces good load balancing among the back-ends.
If this working set exceeds the size of main memory available for caching documents, frequent cache misses will occur.

2.3 Aiming for Locality

A good hashing function partitions both the name space and the working set more or less evenly among the back-ends.
If this is the case, the cache in each b a c k-end should achieve a m uch higher hit rate, since it is only trying to cache its subset of the working set, rather than the entire working set, as with load balancing based approaches.
What is a good partitioning for locality may, h o wever, easily prove a poor choice of partitioning for load balancing.
If a small set of targets in the working set account for a large fraction of the incoming requests, the back-ends serving those targets will be far more loaded than others.

2.4 Basic Locality-Aware Request Distribution

Simulations to test the sensitivity of their strategy to these parameter settings show that the maximal delay di erence increases approximately linearly with Thigh ; Tlow.
The throughput increases mildly and eventually attens as Thigh;Tlow increases.
Thigh should be set to the largest possible value that still satis es the desired bound on the delay di erence between back-end nodes.
The setting of Tlow can be conservatively high with no adverse impact on throughput and only a mild increase in the average delay.
F urthermore, if desired, the setting of Tlow can be easily automated by requesting explicit load information from the back-end nodes during a \training phase".

2.5 LARD with Replication

A potential problem with the basic LARD strategy is that a given target is served by only a single node at any given time.
If a single target causes a back-end to go into an overload situation, the desirable action is to assign several back-end nodes to serve that document, and to distribute requests for that target among the serving nodes.
The front-end maintains a mapping from targets to a set of nodes that serve the target.
Requests for a target are assigned to the least loaded node in the target's server set.
This ensures that the degree of replication for a target does not remain unnecessarily high once it is requested less often.

2.6 Discussion

This can be of concern in servers with very large databases.
The mappings can be maintained in an LRU cache, where assignments for targets that have not been accessed recently are discarded.
Discarding mappings for such targets is of little consequence, as these targets have most likely been evicted from the back-end nodes' caches anyway.

3.1 Simulation Model

The cache replacement policy the authors c hose for all simulations is Greedy-Dual-Size (GDS), as it appears to be the best known policy for Web workloads 5].
The authors have also performed simulations with LRU, where les with a size of more than 500KB are never cached.
The relative performance of the various distribution strategies remained largely una ected.

3.3 Simulation Outputs

This value was determined by inspection of the simulator's disk and CPU activity statistics as a point b e l o w which a node's disk and CPU both had some idle time in virtually all cases.
The cache hit rate gives an indication of how w ell locality is being maintained, and the node underutilization times indicate how w ell load balancing is maintained.

4 Simulation Results

The throughput achieved with LARD/R exceeds that of LARD slightly for seven or more nodes, while achieving lower cache miss ratio and lower idle time.
While WRR/GMS achieves a substantial performance advantage over WRR, its throughput remains below 5 0 % o f LARD and LARD/R's throughput for all cluster sizes.
10 shows the throughput results obtained for the various strategies on the IBM trace (www.ibm.com).
The average le size is smaller than in the Rice trace, resulting in much larger throughput numbers for all strategies.
Thus, LARD and LARD/R achieve superlinear speedup only up to 4 nodes in this trace, resulting in a throughput that is slighly more than twice that of WRR for 4 nodes and above.

4.2 Other Workloads

The authors also ran simulations on a trace from the IBM web server hosting the Deep Blue/Kasparov Chess match i n : LARD vs CPU May 1997.
The working set of this trace is very small and achieves a low miss ratio with a main memory cache of a single node (32 MB).
This trace presents a best-case scenario for WRR and a w orst-case scenario for LARD, as there is nothing to be gained from an aggregation of cache size, but there is the potential to lose performance due to imperfect load balancing.
The authors results show that both LARD and LARD/R closely match the performance of WRR on this trace.
This is reassuring, as it demonstrates that their strategy can match the performance of WRR even under conditions that are favorable to WRR.

4.4 Delay

Connection establishment, hando , and forwarding are independent for di erent connections, and can be easily parallelized 24].
The dispatcher, on the other hand, requires shared state and thus synchronization among the CPUs.
With a simple policy such as LARD/R, the time spent in the dispatcher amounts to only a small fraction of the hando overhead (10-20%).
Therefore, the authors fully expect that the front-end performance can be scaled to larger clusters e ectively using an inexpensive SMP platform equipped with multiple network interfaces.

6.3 Cluster Performance Results

IBM's Lava project 18] uses the concept of a \hit server".
The hit server is a specially con gured server node responsible for serving cached content.
Its specialized OS and client-server protocols give it superior performance for handling HTTP requests of cached documents, but limits it to private Intranets.
Requests for uncached documents and dynamic content are delegated to a separate, conventional HTTP server node.
The authors work shares some of the same goals, but maintains standard client-server protocols, maintains support for dynamic content generation, and focuses on cluster servers.

8 Conclusion

Caching can also be e ective for dynamically generated content 15].
Moreover, resources required for dynamic content generation like server processes, executables, and primary data les are also cacheable.
While further research i s required, the authors expect that increased locality can bene t dynamic content serving, and that therefore the advantages of LARD also apply to dynamic content.

Did you find this useful? Give us your feedback

Figures (2)

Content maybe subject to copyright Report

Lo cality-Aware Request Distribution in Cluster-based Network Servers

Vivek S. Pai

, Mohit Aron

,GauravBanga

Michael Svendsen

,Peter Druschel

, Willy Zwaenepoel

, ErichNahum

{

Department of Electrical and Computer Engineering, Rice University

Department of Computer Science, Rice University

{

IBM T.J. Watson ResearchCenter

Abstract

We consider cluster-based network servers in whicha

front-end directs incoming requests to one of a num-

ber of back-ends. Specically,we consider

content-based

request distribution

: the front-end uses the contentre-

quested, in addition to information ab out the load on

the back-end nodes, to choose whichback-end will han-

dle this request. Content-based request distribution can

improve locality in the back-ends' main memory caches,

increase secondary storage scalabilityby partitioning

the server's database, and provide the ability to employ

back-end no des that are sp ecialized for certain types of

requests.

As a sp ecic p olicy for content-based request dis-

tribution, weintroduce a simple, practical strategy

for

locality-aware

request distribution (LARD). With

LARD, the front-end distributes incoming requests in

a manner that achieves high lo cality in the back-ends'

main memory caches as well as load balancing. Local-

ity is increased by dynamically subdividi ng the server's

working set over the back-ends. Trace-based simulation

results and measurements on a prototype implemen-

tation demonstrate substantial p erformance improve-

ments over state-of-the-art approaches that use only

load information to distribute requests. On workloads

with working sets that do not t in a single server node's

main memory cache, the achieved throughput exceeds

that of the state-of-the-art approachby a factor of two

to four.

With content-based distribution, incoming requests

must be handed o to a back-end in a manner trans-

parent to the client,

after

the front-end has inspected

the content of the request. To this end, weintroduce an

ecient

TCP hando protocol

that can hand o an es-

tablished TCP connection in a client-transparent man-

ner.

To appear in the Pro ceedings of the Eighth International

Conference on Architectural Supp ort for Programming Lan-

guages and Operating Systems (ASPLOS-VII I), San Jose,

CA, Oct 1998.

1 Introduction

Network servers based on clusters of commo ditywork-

stations or PCs connected by high-speed LANs combine

cutting-edge performance and low cost. A cluster-based

network server consists of a front-end, resp onsible for re-

quest distribution, and a number of back-end nodes, re-

sponsible for request pro cessing. The use of a front-end

makes the distributed nature of the server transparent

to the clients. In most current cluster servers the front-

end distributes requests to back-end nodes without re-

gard to the type of service or the content requested.

That is, all back-end no des are considered equally capa-

ble of serving a given request and the only factor guiding

the request distribution is the current load of the back-

end no des.

With

content-basedrequest distribution

, the front-

end takes into account both the service/contentre-

quested and the current load on the back-end nodes

when deciding which back-end node should serve a given

request. The p otential advantages of content-based re-

quest distribution are: (1) increased performance due

to improved hit rates in the back-end's main memory

caches, (2) increased secondary storage scalabili tydue

to the ability to partition the server's database over the

dierentback-end nodes, and (3) the ability to employ

back-end no des that are sp ecialized for certain types of

requests (e.g., audio and video).

The

locality-awarerequest distribution

(LARD) strat-

egy presented in this pap er is a form of content-based

request distribution, focusing on obtaining the rst of

the advantages cited ab ove, namely improved cache hit

rates in the back-ends. Secondary storage scalability

and sp ecial-purpose back-end no des are not discussed

any further in this pap er.

Figure 1 illustrates the principle of LARD in a simple

server with twoback-ends and three targets

(A,B,C) in

the incoming request stream. The front-end directs all

requests for

to back-end 1, and all requests for

and

to back-end 2. By doing so, there is an increased like-

lihoo d that the request nds the requested target in the

cache at the back-end. In contrast, with a round-robin

distribution of incoming requests, requests of all three

In the following discussion, the term

target

is being used

to refer to a sp ecic object requested from a server. For an

HTTP server, for instance, a target is sp ecied byaURLand

any applicable arguments to the HTTP

GET

command.

Figure 1: Locality-Aware Request Distributio n

targets will arrive at b oth back-ends. This increases the

likelihoo d of a cache miss, if the sum of the sizes of the

three targets, or, more generally, if the size of the work-

ing set exceeds the size of the main memory cache at an

individual back-end no de.

Of course, bynaively distributin g incoming requests

in a content-based manner as suggested in Figure 1, the

load between dierentback-ends might become unbal-

anced, resulting in worse p erformance. The rst ma-

jor challenge in building a LARD cluster is therefore to

design a practical and ecient strategy that

simultane-

ously

achieves load balancing and high cache hit rates

on the back-ends. The second challenge stems from the

need for a proto col that allows the front-end to hand o

an established client connection to a back-end no de, in

a manner that is transparent to clients and is ecient

enough not to render the front-end a b ottleneck. This

requirement results from the front-end's need to inspect

the target content of a request

prior

to assigning the

request to a back-end node. This paper demonstrates

that these challenges can b e met, and that LARD pro-

duces substantially higher throughput than the state-of-

the-art approaches where request distribution is solely

based on load balancing, for workloads whose working

set exceeds the size of the individual node caches.

Increasing a server's cache eectiveness is an impor-

tant step towards meeting the demands placed on cur-

rent and future network servers. Being able to cache the

working set is critical to achieving high throughput, as

a state-of-the-art disk device can deliver no more than

120 blo ck requests/sec, while high-end network serv

ers

will be expected to serve thousands of documentre-

quests per second. Moreover, typical working set sizes

of web servers can be expected to growover time, for

two reasons. First, the amountofcontentmadeavail-

able by a single organization is typically growing over

time. Second, there is a trend towards centralization

of web servers within organizations. Issues suchascost

and ease of administration, availability, security,and

high-capacitybackbone network access cause organiza-

tions to movetowards large, centralized network servers

that handle all of the organization's web presence. Such

servers have to handle the combined working sets of all

the servers they sup ersede.

With round-robin distribution, a cluster does not

scale well to larger working sets, as

each

node's main

memory cache has to t the entire working set. With

LARD, the eectivecache size approaches the

sum

the node cache sizes. Thus, adding nodes to a cluster

can accommodate both increased trac (due to addi-

tional CPU power) and larger working sets (due to the

increased eective cache size).

This paper presents the following contributions:

1. a practical and ecient LARD strategy that achieves

high cache hit rates and go od load balancing,

2. a trace-driven simulation that demonstrates the per-

formance p otential of locality-aware request distribu-

tion,

3. an ecient

TCP hando protocol

, that enables

content-based request distributio n byproviding client-

transparent connection hando for TCP-based network

services, and

4. a performance evaluation of a prototype LARD

server cluster, incorporating the TCP hando protocol

and the LARD strategy.

The outline of the rest of this pap er is as follows:

In Section 2 wedevelop our strategy for locality-aware

request distribution. In Section 3 we describe the model

used to simulate the performance of LARD in compari-

son to other request distribution strategies. In Section 4

we present the results of the simulation. In Section 5

wemove on to the practical implementation of LARD,

particularly the TCP hando proto col. We describe the

experimental environmentinwhich our LARD server

is implemented and its measured p erformance in Sec-

tion 6. We describ e related work in Section 7 and we

conclude in Section 8.

2 Strategies for Request Distribution

2.1 Assumptions

The following assumptions hold for all request distribu-

tion strategies considered in this pap er:



The front-end is responsible for handing o new con-

nections and passing incoming data from the clientto

the back-end no des. As a result, it must keep trackof

open and closed connections, and it can use this infor-

mation in making load balancing decisions. The front-

end is not involved in handling outgoing data, whichis

sent directly from the back-ends to the clients.



The front-end limits the number of outstanding re-

quests at the back-ends. This approachallows the front-

end more exibility in resp onding to changing load on

the back-ends, since waiting requests can be directed to

back-ends as capacity becomes available. In contrast,

if we queued requests only on the back-end no des, a

slow node could cause many requests to b e delayed even

though other no des mighthave free capacity.



Anyback-end no de is capable of serving any target,

although in certain request distribution strategies, the

front-end may direct a request only to a subset of the

back-ends.

2.2 Aiming for Balanced Load

In state-of-the-art cluster servers, the front-end uses

weightedround-robin

request distribution 7, 14]. The

incoming requests are distributed in round-robin fash-

ion, weighted by some measure of the load on the dier-

ent back-ends. For instance, the CPU and disk utiliza-

tion, or the number of open connections in eachback-

end may be used as an estimate of the load.

This strategy produces goo d load balancing among

the back-ends. However, since it does not consider the

type of service or requested documentinchoosing a

back-end no de, eachback-end no de is equally likely to

receive a given type of request. Therefore, eachback-

end no de receives an approximately identical working

set of requests, and caches an approximately identical

set of do cuments. If this working set exceeds the size of

main memory available for caching do cuments, frequent

cache misses will o ccur.

2.3 Aiming for Lo cality

In order to improve locality in the back-end's cache,

a simple front-end strategy consists of partitioning the

name space of the database in some way, and assign-

ing request for all targets in a particular partition to a

particular back-end. For instance, a hash function can

be used to p erform the partitioning. We will call this

strategy

locality-based

LB].

A good hashing function partitions b oth the name

space and the working set more or less evenly among the

back-ends. If this is the case, the cache in eachback-end

should achieveamuch higher hit rate, since it is only

trying to cache its subset of the working set, rather than

the entire working set, as with load balancing based

approaches. What is a go od partitioning for locality

may,however, easily prove a poor choice of partitioning

for load balancing. For example, if a small set of targets

in the working set account for a large fraction of the

incoming requests, the back-ends serving those targets

will b e far more loaded than others.

2.4 Basic Locality-Aware Request Distribution

The goal of LARD is to combine goo d load balancing

and high locality.We develop our strategy in two steps.

The basic strategy, described in this subsection, always

assigns a single back-end no de to serve a given target,

thus making the idealized assumption that a single tar-

get cannot by itself exceed the capacity of one no de.

This restriction is removed in the next subsection, where

we present the complete strategy.

Figure 2 presents pseudo-co de for the basic LARD.

The front-end maintains a one-to-one mapping of tar-

gets to back-end nodes in the

server

array. When the

rst request arrives for a given target, it is assigned a

back-end node bychoosing a lightly loaded back-end

node. Subsequent requests are directed to a target's as-

signed back-end no de, unless that no de is overloaded.

In the latter case, the target is assigned a new back-end

node from the current set of lightly loaded nodes.

A no de's load is measured as the number of active

connections, i.e., connections that have been handed o

to the no de, hav

enotyet completed, and are show-

ing request activity. Observe that an overloaded node

will fall behind and the resulting queuing of requests

will cause its number of active connections to increase,

while the number of active connections at an under-

loaded node will tend to zero. Monitoring the relative

while (true)

fetch next request r

if serverr.target] = null then

n, serverr.target]

f

least loaded no de



else



serverr.target]

if (n.load

high

node with load

low

)

n.load





high

then

n, serverr.target]

f

least loaded node



sendrton

Figure 2: The Basic LARD Strategy

number of active connections allows the front-end to es-

timate the amount of \outstanding work" and thus the

relative load on a back-end without requiring explicit

communication with the back-end node.

The intuition for the basic LARD strategy is as fol-

lows: The distribution of targets when they are rst re-

quested leads to a partitioning of the name space of the

database, and indirectly to a partitioning of the working

set, much in the same way as with the strategy purely

aiming for lo cality. It also derives similar localitygains

from doing so. Only when there is a signicant load im-

balance do we diverge from this strategy and re-assign

targets. The denition of a \signican

t load imbalance"

tries to reconcile two competing goals. On one hand, we

do not want greatly diverging load values on dierent

back-ends. On the other hand, given the cache misses

and disk activity resulting from re-assignment, wedo

not want to re-assign targets to smo oth out only minor

or temporary load imbalances. It suces to make sure

that no node has idle resources while another back-end

is dropping behind.

We dene

low

as the load (in number of active con-

nections) belowwhich a back-end is likely to have idle

resources. We dene

high

as the load above whicha

node is likely to cause substantial delay in serving re-

quests. If a situation is detected where a node has a

load larger than

high

while another no de has a load

less than

low

,atargetismoved from the high-load to

the low-load back-end. In addition, to limit the delay

variance among dierent nodes, once a node reaches a

load of 2

high

, a target is moved to a less loaded no de,

even if no no de has a load of less than

low

If the front-end did not limit the total number of ac-

tive connections admitted into the cluster, the load on

all nodes could rise to 2

high

, and LARD would then

behavelike WRR. To prevent this, the front-end lim-

its the sum total of connections handed to all back-end

nodes to the value

;



high

low

;

1, where

is the number of back-end nodes. Setting

to this

value ensures that at most

;

2nodescanhaveaload



high

while no node has load

low

.At the same

time, enough connections are admitted to ensure all

nodes can have a load above

low

(i.e., b e fully utilized)

and still leave ro om for a limited amount of load imbal-

ance between the nodes (to prevent unnecessary target

reassignments in the interest of locality).

The two conditions for deciding when to moveatar-

get attempt to ensure that the cost of moving is incurred

only when the load dierence is substantial enough to

warrant doing so. Whenever a target gets reassigned,

our two tests combined with the denition of

ensure

that the load dierence between the old and new tar-

gets is at least

high

;

low

.To see this, note that the

denition of

implies that there must always exist a

node with a load

high

. The maximal load imbalance

that can arise is 2

high

;

low

The appropriate setting for

low

depends on the

speed of the back-end no des. In practice,

low

should be

chosen high enough to avoid idle resources on back-end

nodes, which could cause throughput loss. Given

low

choosing

high

involves a tradeo.

high

;

low

should

be low enough to limit the delayvariance among the

back-ends to acceptable levels, but high enough to tol-

erate limited load imbalance and short-term load uc-

tuations without destroying locality.

Simulations to test the sensitivity of our strategy to

these parameter settings show that the maximal delay

dierence increases approximately linearly with

high

;

low

. The throughput increases mildly and eventually

attens as

high

;

low

increases. Therefore,

high

should

be set to the largest p ossible value that still satises the

desired b ound on the delay dierence b etween back-end

nodes. Given a desired maximal delay dierence of

secs and an average request service time of

secs,

high

should be set to (

low

D=R

)

2, subject to the obvi-

ous constraint that

high

low

. The setting of

low

can be conservatively high with no adverse impact on

throughput and only a mild increase in the average de-

lay.Furthermore, if desired, the setting of

low

can b e

easily automated by requesting explicit load information

from the back-end nodes during a \training phase". In

our simulations and in the prototype, wehave found set-

tings of

low

=25 and

high

= 65 active connections to

give go od p erformance across all workloads we tested.

2.5 LARD with Replication

A potential problem with the basic LARD strategy is

that a given target is served by only a single node at any

given time. However, if a single target causes a back-end

to go into an overload situation, the desirable action is

to assign several back-end no des to serve that document,

and to distribute requests for that target among the

serving no des. This leads us to the second version of

our strategy, whichallows replication.

Pseudo-code for this strategy is shown in Figure 3.

It diers from the original one as follows: The front-end

maintains a mapping from targets to a

set

of nodes that

serve the target. Requests for a target are assigned to

the least loaded no de in the target's server set. If a load

imbalance o ccurs, the front-end checks if the requested

document's server set has changed recently (within

seconds). If so, it picks a lightly loaded node and adds

that no de to the server set for the target. On the other

hand, if a request target has multiple servers and has

not moved or had a server node added for some time

(

seconds), the front-end removes one no de from the

target's server set. This ensures that the degree of repli-

cation for a target does not remain unnecessarily high

once it is requested less often. In our exp eriments, we

used values of

= 20 secs.

2.6 Discussion

As will be seen in Sections 4 and 6, the LARD strate-

gies result in a go od combination of load balancing and

locality. In addition, the strategies outlined abovehave

while (true)

fetch next request r

if serverSetr.target] =



then

n, serverSetr.target]

f

least loaded node



else

f

least loaded no de in serverSetr.target]



f

most loaded no de in serverSetr.target]



if (n.load

high

node with load

low

)

n.load



high

then

f

least loaded no de



add p to serverSetr.target]



p

serverSetr.target]

1&&

time() - serverSetr.target].lastMod

Kthen

remove m from serverSetr.target]

sendrton

if serverSetr.target] changed in this iteration then

serverSetr.target].lastMod



time()

Figure 3: LARD with Replication

several desirable features. First, they do not require

any extra communication between the front-end and the

back-ends. Second, the front-end need not keep track

of any frequency of access information or try to model

the contents of the caches of the back-ends. In particu-

lar, the strategy is independent of the lo cal replacement

policy used bytheback-ends. Third, the absence of

elaborate state in the front-end makes it rather straight-

forward to recover from a back-end node failure. The

front-end simply re-assigns targets assigned to the failed

back-end as if they had not been assigned before. For

all these reasons, we argue that the proposed strategy

can b e implemented without undue complexity.

In a simple implementation of the two strategies, the

size of the

server

serverSet

arrays, resp ectively, can

growtothenumber of targets in the server's database.

Despite the low storage overhead p er target, this can

be of concern in servers with very large databases. In

this case, the mappings can b e maintained in an LRU

cache, where assignments for targets that have not b een

accessed recently are discarded. Discarding mappings

for such targets is of little consequence, as these targets

have most likely b een evicted from the back-end no des'

caches anyway.

3 Simulation

To study various request distribution p olicies for a range

of cluster sizes under dierent assumptions for CPU

speed, amount of memory,number of disks and other

parameters, we developed a congurable web server clus-

ter simulator. We also implemented a prototype of a

LARD-based cluster, which is described in Section 6.

3.1 Simulation Model

The simulation mo del is depicted in Figure 4. Each

back-end no de consists of a CPU and locally-attached

disk(s), with separate queues for each. In addition, each

node maintains its own main memory cache of con-

gurable size and replacement p olicy. For simplicity,

caching is p erformed on a whole-le basis.

Processing a request requires the following steps:

Figure 4: Cluster Simulation Mo del

connection establishment, disk reads (if needed), target

data transmission, and connection teardown. The as-

sumption is that front-end and networks are fast enough

not to limit the cluster's p erformance, thus fully expos-

ing the throughput limits of the back-ends. Therefore,

the front-end is assumed to havenooverhead and all

networks have innite capacity in the simulations.

The individ ual processing steps for a given request

must b e performed in sequence, but the CPU and disk

times for diering requests can be overlapped. Also,

large le reads are blo cked, such that the data transmis-

sion immediately follows the disk read for eachblock.

Multiple requests waiting on the same le from disk

can b e satised with only one disk read, since all the re-

quests can access the data once it is cachedinmemory.

The costs for the basic request pro cessing steps

used in our simulations were derived by p erforming

measurements on a 300 Mhz Pentium I I machine run-

ning FreeBSD 2.2.5 and an aggressive exp erimental web

server. Connection establishment and teardown costs

are set at 145



s of CPU time each, while transmit pro-

cessing incurs 40



s per 512 bytes. Using these num-

bers, an 8 KByte do cument can be served from the

main memory cache at a rate of approximately 1075

requests/sec.

If disk access is required, reading a le from disk has

a latency of 28 ms (2 seeks + rotational latency). The

disk transfer time is 410



s p er 4 KByte (resulting in

approximately 10 MBytes/sec p eak transfer rate). For

les larger than 44 KBytes, an additional 14 ms (seek

plus rotational latency) is charged for every 44 KBytes

of le length in excess of 44 KBytes. 44 KBytes was

measured as the av

erage disk transfer size between seeks

in our experimental server. Unless otherwise stated,

each back-end node has one disk.

The cache replacement policy wechose for all sim-

ulations is Greedy-Dual-Size (GDS), as it appears to

be the b est known policy for Web workloads 5]. We

have also performed simulations with LRU, where les

with a size of more than 500KB are never cached. The

relative p erformance of the various distribution strate-

gies remained largely unaected. However, the absolute

throughput results were up to 30% lower with LRUthan

with GDS.

3.2 Simulation Inputs

The input to the simulator is a stream of tokenized tar-

get requests, where each token represents a unique tar-

get b eing served. Associated with eachtoken is a target

size in bytes. This tokenized stream can be syntheti-

cally created, or it can be generated by pro cessing logs

from existing web servers.

One of the traces we use was generated bycombin-

ing logs from multiple departmental web servers at Rice

University. This trace spans a two-month p eriod. An-

other trace comes from IBM Corp oration's main web

server (www.ibm.com) and represents server logs for a

perio d of 3.5 days starting at midnight, June 1, 1998.

Figures 5 and 6 show the cumulative distributions of

request frequency and size for the Rice University trace

and the IBM trace, respectively.Shown on the x-axis

is the set of target les in the trace, sorted in decreas-

ing order of request frequency. The y-axis shows the

cumulative fraction of requests and target sizes, nor-

malized to the total number of requests and total data

set size, respectively. The data set for the Rice Univer-

sity trace consist of 37703 targets covering 1418 MB of

space, whereas the IBM trace consists of 38527 targets

and 1029 MB of space. While the data sets in b oth

traces are of a comparable size, it is evident from the

graphs that the Rice trace has much less locality than

the IBM trace. In the Rice trace, 560/705/927 MB of

memory is needed to cover 97/98/99% of all requests,

respectively, while only 51/80/182 MB are needed to

cover the same fractions of requests in the IBM trace.

This dierence is likely to b e caused in part by the

dierent time spans that each trace covers. Also, the

IBM trace is from a single high-trac server, where the

content designers havelikely sp ent eort to minimize

the sizes of high frequency documents in the interest of

performance. The Rice trace, on the other hand, was

merged from the logs of several departmental servers.

As with all caching studies, interesting eects can

only be observed if the size of the working set exceeds

that of the cache. Since even our larger trace has a rel-

atively small data set (and thus a small working set),

and also to anticipate future trends in working set sizes,

wechose to set the default no de cache size in our simu-

lations to 32 MB. Since in reality, the cache has to share

main memory with OS kernel and server applications,

this typically requires at least 64 MB of memory in an

actual server node.

3.3 Simulation Outputs

The simulator calculates overall throughput, hit rate,

and underutilization time. Throughput is the number

HTML Viewer

Frequently Asked Questions (13)

Q1. What contributions have the authors mentioned in the paper "Locality aware request distribution in cluster based network servers" ?

The authors consider cluster based network servers in which a front end directs incoming requests to one of a num ber of back ends Speci cally they consider content based request distribution the front end uses the content re quested in addition to information about the load on the back end nodes to choose which back end will han dle this request Content based request distribution can improve locality in the back ends main memory caches increase secondary storage scalability by partitioning the server s database and provide the ability to employ back end nodes that are specialized for certain types of requests As a speci c policy for content based request dis tribution the authors introduce a simple practical strategy for locality aware request distribution LARD With LARD the front end distributes incoming requests in a manner that achieves high locality in the back ends main memory caches as well as load balancing Local ity is increased by dynamically subdividing the server s working set over the back ends To this end the authors introduce an e cient TCP hando protocol that can hand o an es tablished TCP connection in a client transparent man ner To appear in the Proceedings of the Eighth International Conference on Architectural Support for Programming Lan guages and Operating Systems ASPLOS VIII San Jose CA Oct Introduction Network servers based on clusters of commodity work stations or PCs connected by high speed LANs combine cutting edge performance and low cost A cluster based network server consists of a front end responsible for re quest distribution and a number of back end nodes re sponsible for request processing The locality aware request distribution LARD strat egy presented in this paper is a form of content based request distribution focusing on obtaining the rst of the advantages cited above namely improved cache hit rates in the back ends Secondary storage scalability and special purpose back end nodes are not discussed any further in this paper Figure illustrates the principle of LARD in a simple server with two back ends and three targets A B C in the incoming request stream In the following discussion the term target is being used to refer to a speci c object requested from a server For an HTTP server for instance a target is speci ed by a URL and any applicable arguments to the HTTP GET command Figure Locality Aware Request Distribution targets will arrive at both back ends This increases the likelihood of a cache miss if the sum of the sizes of the three targets or more generally if the size of the work ing set exceeds the size of the main memory cache at an individual back end node This paper demonstrates that these challenges can be met and that LARD pro duces substantially higher throughput than the state of the art approaches where request distribution is solely based on load balancing for workloads whose working set exceeds the size of the individual node caches Increasing a server s cache e ectiveness is an impor tant step towards meeting the demands placed on cur rent and future network servers This paper presents the following contributions a practical and e cient LARD strategy that achieves high cache hit rates and good load balancing a trace driven simulation that demonstrates the per formance potential of locality aware request distribu tion an e cient TCP hando protocol that enables content based request distribution by providing client transparent connection hando for TCP based network services and a performance evaluation of a prototype LARD server cluster incorporating the TCP hando protocol and the LARD strategy The outline of the rest of this paper is as follows In Section the authors describe the model used to simulate the performance of LARD in compari son to other request distribution strategies In Section the authors present the results of the simulation The authors describe the experimental environment in which their LARD server is implemented and its measured performance in Sec tion The authors describe related work in Section and they conclude in Section Strategies for Request Distribution The potential advantages of content based re quest distribution are increased performance due to improved hit rates in the back end s main memory caches increased secondary storage scalability due to the ability to partition the server s database over the di erent back end nodes and the ability to employ back end nodes that are specialized for certain types of requests e g audio and video Of course by naively distributing incoming requests in a content based manner as suggested in Figure the load between di erent back ends might become unbal anced resulting in worse performance

Q2. what is the known policy for Web workloads?

The cache replacement policy the authors chose for all sim ulations is Greedy Dual Size GDS as it appears to be the best known policy for Web workloads

Q3. how many nodes can hold the working set?

The Rice trace requires the combined cache size of eight to ten nodes to hold the working set Since WRR cannot aggregate the cache size the server remains disk bound for all cluster sizes LARD and LARD R on the other hand cause the system to be come increasingly CPU bound for eight or more nodes resulting in superlinear speedup in the node re gion with linear but steeper speedup for more than ten nodes

Q4. what is the protocol module op erating above the network interface?

The module op erates directly above the network interface and executes in the context of the network interface interrupt han dler A simple hash table lookup is required to deter mine whether a packet should be forwarded

Q5. how much time is spent in the dispatcher?

The dispatcher on the other hand requires shared state and thus synchronization among the CPUs However with a simple policy such as LARD R the time spent in the dispatcher amounts to only a small fraction of the hando overhead

Q6. how many disks can be read for each block?

Also large le reads are blocked such that the data transmis sion immediately follows the disk read for each block Multiple requests waiting on the same le from disk can be satis ed with only one disk read since all the re quests can access the data once it is cached in memory

Q7. how many MB of memory is needed to cover all requests in the rice trace?

In the Rice trace MB of memory is needed to cover of all requests respectively while only MB are needed to cover the same fractions of requests in the IBM trace

Q8. what is the effect of the mul tiple disks on the throughput of a?

This can be expected as the in creased cache e ectiveness of LARD R causes a reduced dependence on disk speedWRR on the other hand greatly bene ts from mul tiple disks as its throughput is mainly bound by the performance of the disk subsystem

Q9. what is the effect of adding disks on the performance of a system?

In their nal set of simulations the authors explore the impact of using multiple disks in each back end node on the rel ative performance of LARD R versus WRR Figures and respectively show the throughput results for WRR and LARD R on the combined Rice University trace with di erent numbers of disks per back end node With LARD R a second disk per node yields a mild throughput gain but additional disks do not achieve any further bene t

Q10. how many times did the rice trace have the default CPU speed?

The authors performed simulations on the Rice trace with the default CPU speed setting ex plained in Section and with twice three and four times the default speed setting

Q11. how can the front end be scaled to larger clusters?

The authors have not been able to measure such high throughput directly due to lack of network resources but the measured remaining CPU idle time in the front end at lower throughput is consistent with this gure Further measurements indicate that with the Rice Uni versity trace as the workload the hando throughput and forwarding throughput are su cient to support back end nodes of the same CPU speed as the front endMoreover the front end can be relatively easily scaled to larger clusters either by upgrading to a faster CPU or by employing an SMP machine Connection estab lishment hando and forwarding are independent for di erent connections and can be easily parallelized

Q12. How can the front end be scaled to larger clusters?

Their proposal addresses the com plementary issue of providing support for cost e ective scalable network serversNetwork servers based on clusters of workstations are starting to be widely used Several products are available or have been announced for use as front end nodes in such cluster servers

Q13. what is the difference between LB and WRRA?

This can be clearly seen in the at cache miss ratio curve for WRRAs expected both LB schemes achieve a decrease in cache miss ratio as the number of nodes increases

Locality-aware request distribution in cluster-based network servers

Summary (3 min read)

2.2 Aiming for Balanced Load

2.3 Aiming for Locality

2.4 Basic Locality-Aware Request Distribution

2.5 LARD with Replication

2.6 Discussion

3.1 Simulation Model

3.3 Simulation Outputs

4 Simulation Results

4.2 Other Workloads

4.4 Delay

6.3 Cluster Performance Results

8 Conclusion

Figures (2)

Citations

Cites background or methods from "Locality-aware request distribution..."

Cites methods from "Locality-aware request distribution..."

Cites background from "Locality-aware request distribution..."

References

"Locality-aware request distribution..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (13)

Q1. What contributions have the authors mentioned in the paper "Locality aware request distribution in cluster based network servers" ?

Q2. what is the known policy for Web workloads?

Q3. how many nodes can hold the working set?

Q4. what is the protocol module op erating above the network interface?

Q5. how much time is spent in the dispatcher?

Q6. how many disks can be read for each block?

Q7. how many MB of memory is needed to cover all requests in the rice trace?

Q8. what is the effect of the mul tiple disks on the throughput of a?

Q9. what is the effect of adding disks on the performance of a system?

Q10. how many times did the rice trace have the default CPU speed?

Q11. how can the front end be scaled to larger clusters?

Q12. How can the front end be scaled to larger clusters?

Q13. what is the difference between LB and WRRA?