Locality-aware request distribution in cluster-based network servers
Summary (3 min read)
2.2 Aiming for Balanced Load
- This strategy produces good load balancing among the back-ends.
- If this working set exceeds the size of main memory available for caching documents, frequent cache misses will occur.
2.3 Aiming for Locality
- A good hashing function partitions both the name space and the working set more or less evenly among the back-ends.
- If this is the case, the cache in each b a c k-end should achieve a m uch higher hit rate, since it is only trying to cache its subset of the working set, rather than the entire working set, as with load balancing based approaches.
- What is a good partitioning for locality may, h o wever, easily prove a poor choice of partitioning for load balancing.
- If a small set of targets in the working set account for a large fraction of the incoming requests, the back-ends serving those targets will be far more loaded than others.
2.4 Basic Locality-Aware Request Distribution
- Simulations to test the sensitivity of their strategy to these parameter settings show that the maximal delay di erence increases approximately linearly with Thigh ; Tlow.
- The throughput increases mildly and eventually attens as Thigh;Tlow increases.
- Thigh should be set to the largest possible value that still satis es the desired bound on the delay di erence between back-end nodes.
- The setting of Tlow can be conservatively high with no adverse impact on throughput and only a mild increase in the average delay.
- F urthermore, if desired, the setting of Tlow can be easily automated by requesting explicit load information from the back-end nodes during a \training phase".
2.5 LARD with Replication
- A potential problem with the basic LARD strategy is that a given target is served by only a single node at any given time.
- If a single target causes a back-end to go into an overload situation, the desirable action is to assign several back-end nodes to serve that document, and to distribute requests for that target among the serving nodes.
- The front-end maintains a mapping from targets to a set of nodes that serve the target.
- Requests for a target are assigned to the least loaded node in the target's server set.
- This ensures that the degree of replication for a target does not remain unnecessarily high once it is requested less often.
2.6 Discussion
- This can be of concern in servers with very large databases.
- The mappings can be maintained in an LRU cache, where assignments for targets that have not been accessed recently are discarded.
- Discarding mappings for such targets is of little consequence, as these targets have most likely been evicted from the back-end nodes' caches anyway.
3.1 Simulation Model
- The cache replacement policy the authors c hose for all simulations is Greedy-Dual-Size (GDS), as it appears to be the best known policy for Web workloads 5].
- The authors have also performed simulations with LRU, where les with a size of more than 500KB are never cached.
- The relative performance of the various distribution strategies remained largely una ected.
3.3 Simulation Outputs
- This value was determined by inspection of the simulator's disk and CPU activity statistics as a point b e l o w which a node's disk and CPU both had some idle time in virtually all cases.
- The cache hit rate gives an indication of how w ell locality is being maintained, and the node underutilization times indicate how w ell load balancing is maintained.
4 Simulation Results
- The throughput achieved with LARD/R exceeds that of LARD slightly for seven or more nodes, while achieving lower cache miss ratio and lower idle time.
- While WRR/GMS achieves a substantial performance advantage over WRR, its throughput remains below 5 0 % o f LARD and LARD/R's throughput for all cluster sizes.
- 10 shows the throughput results obtained for the various strategies on the IBM trace (www.ibm.com).
- The average le size is smaller than in the Rice trace, resulting in much larger throughput numbers for all strategies.
- Thus, LARD and LARD/R achieve superlinear speedup only up to 4 nodes in this trace, resulting in a throughput that is slighly more than twice that of WRR for 4 nodes and above.
4.2 Other Workloads
- The authors also ran simulations on a trace from the IBM web server hosting the Deep Blue/Kasparov Chess match i n : LARD vs CPU May 1997.
- The working set of this trace is very small and achieves a low miss ratio with a main memory cache of a single node (32 MB).
- This trace presents a best-case scenario for WRR and a w orst-case scenario for LARD, as there is nothing to be gained from an aggregation of cache size, but there is the potential to lose performance due to imperfect load balancing.
- The authors results show that both LARD and LARD/R closely match the performance of WRR on this trace.
- This is reassuring, as it demonstrates that their strategy can match the performance of WRR even under conditions that are favorable to WRR.
4.4 Delay
- Connection establishment, hando , and forwarding are independent for di erent connections, and can be easily parallelized 24].
- The dispatcher, on the other hand, requires shared state and thus synchronization among the CPUs.
- With a simple policy such as LARD/R, the time spent in the dispatcher amounts to only a small fraction of the hando overhead (10-20%).
- Therefore, the authors fully expect that the front-end performance can be scaled to larger clusters e ectively using an inexpensive SMP platform equipped with multiple network interfaces.
6.3 Cluster Performance Results
- IBM's Lava project 18] uses the concept of a \hit server".
- The hit server is a specially con gured server node responsible for serving cached content.
- Its specialized OS and client-server protocols give it superior performance for handling HTTP requests of cached documents, but limits it to private Intranets.
- Requests for uncached documents and dynamic content are delegated to a separate, conventional HTTP server node.
- The authors work shares some of the same goals, but maintains standard client-server protocols, maintains support for dynamic content generation, and focuses on cluster servers.
8 Conclusion
- Caching can also be e ective for dynamically generated content 15].
- Moreover, resources required for dynamic content generation like server processes, executables, and primary data les are also cacheable.
- While further research i s required, the authors expect that increased locality can bene t dynamic content serving, and that therefore the advantages of LARD also apply to dynamic content.
Did you find this useful? Give us your feedback
Citations
1,492 citations
554 citations
525 citations
Cites background or methods from "Locality-aware request distribution..."
...Once the Web switch has established the TCP connection with the client and selected the target server, it hands off its endpoint of the TCP connection to the server, which can communicate directly with the client [Pai et al. 1998]....
[...]
...The Locality-Aware Request Distribution (LARD) policy is a content-aware request distribution that considers both locality and load balancing [Aron et al. 1999; Pai et al. 1998]....
[...]
...2002] L5 [Apostolopoulos et al. 2000a] [Yang and Luo 2000] Array 500 [Array Networks 2002] Network Dispatcher kernel-level CBR [IBM 2002] ScalaServer [Aron et al. 1999] [Pai et al. 1998] [Tang et al. 2001] ClubWeb [Andreolini et al. 2001] Central Dispatch [Resonate 2002] hardware box with a modi.ed BSDi-Unix kernel, while Web Switch [Lucent Tech....
[...]
...Once the Web switch has established the TCP connection with the client and selected the target server, it hands o its endpoint of the TCP connection to the server [79]....
[...]
...Two-way One-way TCP gateway TCP splicing TCP hando TCP connection hop IBM Network [34] ScalaServer [8, 79] Resonate's Dispatcher CBR [61] Central Dispatch [86] CAP [27] Nortel Networks' Web OS SLB [76] HACC [101] Foundry Networks' ServerIron [51] Cisco's CSS [33] F5 Networks' BIG/ip [48] Radware's WSD Pro+ [85] HydraWEB's Hydra2500 [60] Zeus's Load Balancer [100] [98]...
[...]
396 citations
Cites methods from "Locality-aware request distribution..."
...In this area, some work has focused on using multiple server nodes in parallel [6, 10, 13, 16, 19, 28 ], or sharing memory across machines [12, 15, 21]....
[...]
364 citations
Cites background from "Locality-aware request distribution..."
...The workload predictor outlined in the previous section is not perfect—it may incur prediction errors if the workload on a given day deviates from its behavior on previous days....
[...]
References
1,048 citations
"Locality-aware request distribution..." refers methods in this paper
...The cache replacement policy we chose for all simulations is Greedy-Dual-Size (GDS), as it appears to be the best known policy for Web workloads [5]....
[...]
853 citations
702 citations
666 citations
639 citations
Related Papers (5)
Frequently Asked Questions (13)
Q2. what is the known policy for Web workloads?
The cache replacement policy the authors chose for all sim ulations is Greedy Dual Size GDS as it appears to be the best known policy for Web workloads
Q3. how many nodes can hold the working set?
The Rice trace requires the combined cache size of eight to ten nodes to hold the working set Since WRR cannot aggregate the cache size the server remains disk bound for all cluster sizes LARD and LARD R on the other hand cause the system to be come increasingly CPU bound for eight or more nodes resulting in superlinear speedup in the node re gion with linear but steeper speedup for more than ten nodes
Q4. what is the protocol module op erating above the network interface?
The module op erates directly above the network interface and executes in the context of the network interface interrupt han dler A simple hash table lookup is required to deter mine whether a packet should be forwarded
Q5. how much time is spent in the dispatcher?
The dispatcher on the other hand requires shared state and thus synchronization among the CPUs However with a simple policy such as LARD R the time spent in the dispatcher amounts to only a small fraction of the hando overhead
Q6. how many disks can be read for each block?
Also large le reads are blocked such that the data transmis sion immediately follows the disk read for each block Multiple requests waiting on the same le from disk can be satis ed with only one disk read since all the re quests can access the data once it is cached in memory
Q7. how many MB of memory is needed to cover all requests in the rice trace?
In the Rice trace MB of memory is needed to cover of all requests respectively while only MB are needed to cover the same fractions of requests in the IBM trace
Q8. what is the effect of the mul tiple disks on the throughput of a?
This can be expected as the in creased cache e ectiveness of LARD R causes a reduced dependence on disk speedWRR on the other hand greatly bene ts from mul tiple disks as its throughput is mainly bound by the performance of the disk subsystem
Q9. what is the effect of adding disks on the performance of a system?
In their nal set of simulations the authors explore the impact of using multiple disks in each back end node on the rel ative performance of LARD R versus WRR Figures and respectively show the throughput results for WRR and LARD R on the combined Rice University trace with di erent numbers of disks per back end node With LARD R a second disk per node yields a mild throughput gain but additional disks do not achieve any further bene t
Q10. how many times did the rice trace have the default CPU speed?
The authors performed simulations on the Rice trace with the default CPU speed setting ex plained in Section and with twice three and four times the default speed setting
Q11. how can the front end be scaled to larger clusters?
The authors have not been able to measure such high throughput directly due to lack of network resources but the measured remaining CPU idle time in the front end at lower throughput is consistent with this gure Further measurements indicate that with the Rice Uni versity trace as the workload the hando throughput and forwarding throughput are su cient to support back end nodes of the same CPU speed as the front endMoreover the front end can be relatively easily scaled to larger clusters either by upgrading to a faster CPU or by employing an SMP machine Connection estab lishment hando and forwarding are independent for di erent connections and can be easily parallelized
Q12. How can the front end be scaled to larger clusters?
Their proposal addresses the com plementary issue of providing support for cost e ective scalable network serversNetwork servers based on clusters of workstations are starting to be widely used Several products are available or have been announced for use as front end nodes in such cluster servers
Q13. what is the difference between LB and WRRA?
This can be clearly seen in the at cache miss ratio curve for WRRAs expected both LB schemes achieve a decrease in cache miss ratio as the number of nodes increases