File popularity characterisation
Summary (2 min read)
1. Introduction.
- WWW caching has proved a valuable technique for scaling up the internet [ABR95, BAE 97].
- This would be useful because previous observations of Zipf’s law have been largely culture independent, and if some culture independent cache metrics could be established cache models would not need to take account of cultural effects.
- It is not at all clear that cache logs reflect human choices, since not all of a user’s web requests reach the network cache.
- Derived from the literature and their own imagination, and propose tests of the explanations.the authors.
2. Theories.
- One possible hypothesis (derived from a related proposal by Zipf [ZIP49]) is that caches at different levels of the hierarchy have different exponents for best-fit power laws, and caches higher up the hierarchy would have smaller exponents.
- This is because requests for more popular files are reduced more than requests for less popular files, since only the first request for a file from a low level cache reaches a high level cache.
- This can be tested by accurately determining the exponent for a range of caches, at the same position in the hierarchy, and finding a correlation between exponent and size.
- If the behaviour of individuals is strongly correlated (e.g. by information waves) on a range of timescales with an infinite variance, then the popularity curve exponent will exhibit variation regardless of sample size or timescale.
- From consideration of the work of Zipf on word use in different cultures, it seems likely that cultural differences will often be expressed through differences in the K factor in the power curve rather than the exponent.
3. Techniques.
- To analyse file popularity, cache logs are usually needed, the only alternative being the correctly processed output from such a cache log.
- The authors are indebted to several sources for making their logs available, and hope this is fully shown in the acknowledgements.
- At the moment cache logs do not contain the means to discriminate between the physical request made by the client and files that are requested by linkage (linked image file, redirections etc) to the requested files.
- Another analysis irregularity is that some researchers look at the popularity of generic hosts and not files.
- The quality of the fit was checked using the standard R2 test.
4. Variability of Locality.
- In order to compare data from different caches reliably it is necessary to ensure that differences are real and not due to insufficiently large samples.
- This cache receives about 10000 requests per day from a research community.
- The least squares procedure can then be used to find the slope of the line with best fit.
- If the data shows longrange dependence the sample size required to get a reliable estimate of the slope of the popularity curve will be considerably larger than might be expected for normal Poisson statistics.
- The exponent converges to a stable value for samples of 300 000 or more requests, for all the caches the authors have analysed.
5. Analysis.
- The authors have been able to obtain samples in excess of 500000 file requests for 5 very different caches.
- The authors show in figures 7 and 8 the popularity curves for these caches, and the curves fitted to the data using the techniques outlined in section 3.
- In table 1 the authors show the estimated value of the exponent in the power law, together with the error interval and the confidence limit established by the R2 test.
- FUNET and Spain are national caches, RMPLC and PISA are local caches serving very different communities.
- Error estimates were calculated using several methods, the ones shown were the largest calculated.
6. Discussion.
- The data in section 5 supported the notion that the variation in cache popularity curves is simply due to the hierarchical position of the cache.
- Figure 9 shows cache size plotted against exponent.
- It is hard to imagine a user community more different from the undergraduates, lecturers and researchers at Pisa university.
- The lack of significant differences between caches at similar apparent levels in the hierarchy means that client effect are not significant either.
- These models require an accurate description of real cache behavior so their performance can be accurately assessed.
7. Conclusion.
- The analysis of cache popularity curves requires careful definition of what is to be analysed and, since the data displays significant long range dependency, very large sample sizes.
- Further data should be analysed to fully confirm the relative independence of the metric.
- The authors would like to thank Pekka Järveläinen for supplying us with anonymised logs for the Funet proxy cache, Simon Rainey , Javier Puche (Centro Superior de Investigaciones Cientificas) and Luigi Rizzo (Pisa).
- 'On the implications of Zipf's Law for web caching'. in 3W3Cache Workshop, Manchester, June 1998. [CUN95] C Cunha, A Bestavros, and M Crovella.
Did you find this useful? Give us your feedback
Citations
8 citations
6 citations
Additional excerpts
...[14] 0,5 - 0,9 1999 5 Proxy...
[...]
6 citations
5 citations
5 citations
References
3,582 citations
1,549 citations
772 citations
624 citations
"File popularity characterisation" refers background in this paper
...We show that locality can be characterised with a single parameter, which primarily varies with the topological position of the cache, and is largely independent of the culture of the cache users....
[...]
Related Papers (5)
Frequently Asked Questions (11)
Q2. How many requests have been obtained for 5 different caches?
Number of requests used to calculate exponentThe authors have been able to obtain samples in excess of 500000 file requests for 5 very different caches.
Q3. What is the way to compare data from different caches?
In order to compare data from different caches reliably it is necessary to ensure that differences are real and not due to insufficiently large samples.
Q4. How can the authors fit an inverse power law curve to cache popularity curves?
With appropriate care it is possible to fit an inverse power law curve to cache popularity curves, with an exponent of between -0.9 and -0.5, and with a high degree of confidence.
Q5. What is the likely explanation for the differences in the popularity curve?
From consideration of the work of Zipf on word use in different cultures, it seems likely that cultural differences will often be expressed through differences in the K factor in the power curve rather than the exponent.
Q6. What is the reason why the exponent of the locality curve is not a fit?
While filtering is one possible factor affecting the exponent of the locality curve, other factors possibly influence the exponent.
Q7. What is the importance of the analysis of cache popularity curves?
The analysis of cache popularity curves requires careful definition of what is to be analysed and, since the data displays significant long range dependency, very large sample sizes.
Q8. What is the significance of the power law curves?
The authors demonstrate for the first time in this paper that, with appropriate care in the analysis, it can be shown that whilst the power law curves are not strictly Zipf curves they are still culture independent.
Q9. How many months did the exponent range from -0.23 to -1.34?
Over these six months the fitted exponent ranged from -0.23 to -1.34 with a mean of -0.5958 and a variance of 0.03 (figure 4), using the 'averaged' ranking method mentioned above.
Q10. What is the exponent of the cache popularity curve?
The exponent does not appear to depend on cache size, on time, or on the culture of the cache users, but only depends on the topological position of the cache in the network.
Q11. What are the useful metrics in cache logs?
Cache logs can be extremely comprehensive, detailing time of request, bytes transferred, file name and other useful metrics [e.g. ftp://ircache.nlanr.net/Traces/].