scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Web caching and Zipf-like distributions: evidence and implications

21 Mar 1999-Vol. 1, pp 126-134
TL;DR: This paper investigates the page request distribution seen by Web proxy caches using traces from a variety of sources and considers a simple model where the Web accesses are independent and the reference probability of the documents follows a Zipf-like distribution, suggesting that the various observed properties of hit-ratios and temporal locality are indeed inherent to Web accesse observed by proxies.
Abstract: This paper addresses two unresolved issues about Web caching. The first issue is whether Web requests from a fixed user community are distributed according to Zipf's (1929) law. The second issue relates to a number of studies on the characteristics of Web proxy traces, which have shown that the hit-ratios and temporal locality of the traces exhibit certain asymptotic properties that are uniform across the different sets of the traces. In particular, the question is whether these properties are inherent to Web accesses or whether they are simply an artifact of the traces. An answer to these unresolved issues will facilitate both Web cache resource planning and cache hierarchy design. We show that the answers to the two questions are related. We first investigate the page request distribution seen by Web proxy caches using traces from a variety of sources. We find that the distribution does not follow Zipf's law precisely, but instead follows a Zipf-like distribution with the exponent varying from trace to trace. Furthermore, we find that there is only (i) a weak correlation between the access frequency of a Web page and its size and (ii) a weak correlation between access frequency and its rate of change. We then consider a simple model where the Web accesses are independent and the reference probability of the documents follows a Zipf-like distribution. We find that the model yields asymptotic behaviour that are consistent with the experimental observations, suggesting that the various observed properties of hit-ratios and temporal locality are indeed inherent to Web accesses observed by proxies. Finally, we revisit Web cache replacement algorithms and show that the algorithm that is suggested by this simple model performs best on real trace data. The results indicate that while page requests do indeed reveal short-term correlations and other structures, a simple model for an independent request stream following a Zipf-like distribution is sufficient to capture certain asymptotic properties observed at Web proxies.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: A survey of the different security risks that pose a threat to the cloud is presented and a new model targeting at improving features of an existing model must not risk or threaten other important features of the current model.

2,511 citations

Journal ArticleDOI
TL;DR: This paper demonstrates the benefits of cache sharing, measures the overhead of the existing protocols, and proposes a new protocol called "summary cache", which reduces the number of intercache protocol messages, reduces the bandwidth consumption, and eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP.
Abstract: The sharing of caches among Web proxies is an important technique to reduce Web traffic and alleviate network bottlenecks. Nevertheless it is not widely deployed due to the overhead of existing protocols. In this paper we demonstrate the benefits of cache sharing, measure the overhead of the existing protocols, and propose a new protocol called "summary cache". In this new protocol, each proxy keeps a summary of the cache directory of each participating proxy, and checks these summaries for potential hits before sending any queries. Two factors contribute to our protocol's low overhead: the summaries are updated only periodically, and the directory representations are very economical, as low as 8 bits per entry. Using trace-driven simulations and a prototype implementation, we show that, compared to existing protocols such as the Internet cache protocol (ICP), summary cache reduces the number of intercache protocol messages by a factor of 25 to 60, reduces the bandwidth consumption by over 50%, eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP. Hence summary cache scales to a large number of proxies. (This paper is a revision of Fan et al. 1998; we add more data and analysis in this version.).

2,174 citations


Cites background from "Web caching and Zipf-like distribut..."

  • ...In the benchmark, each client issues requests following the temporal locality patterns observed in [38], [10], [8], and the inherent cache hit ratio in the request stream can be adjusted....

    [...]

Journal ArticleDOI
TL;DR: A survey and comparison of various Structured and Unstructured P2P overlay networks is presented, categorize the various schemes into these two groups in the design spectrum, and discusses the application-level network performance of each group.
Abstract: Over the Internet today, computing and communications environments are significantly more complex and chaotic than classical distributed systems, lacking any centralized organization or hierarchical control. There has been much interest in emerging Peer-to-Peer (P2P) network overlays because they provide a good substrate for creating large-scale data sharing, content distribution, and application-level multicast applications. These P2P overlay networks attempt to provide a long list of features, such as: selection of nearby peers, redundant storage, efficient search/location of data items, data permanence or guarantees, hierarchical naming, trust and authentication, and anonymity. P2P networks potentially offer an efficient routing architecture that is self-organizing, massively scalable, and robust in the wide-area, combining fault tolerance, load balancing, and explicit notion of locality. In this article we present a survey and comparison of various Structured and Unstructured P2P overlay networks. We categorize the various schemes into these two groups in the design spectrum, and discuss the application-level network performance of each group.

1,638 citations

Journal ArticleDOI
TL;DR: This work shows that the uncoded optimum file assignment is NP-hard, and develops a greedy strategy that is provably within a factor 2 of the optimum, and provides an efficient algorithm achieving a provably better approximation ratio of 1-1/d d, where d is the maximum number of helpers a user can be connected to.
Abstract: Video on-demand streaming from Internet-based servers is becoming one of the most important services offered by wireless networks today. In order to improve the area spectral efficiency of video transmission in cellular systems, small cells heterogeneous architectures (e.g., femtocells, WiFi off-loading) are being proposed, such that video traffic to nomadic users can be handled by short-range links to the nearest small cell access points (referred to as “helpers”). As the helper deployment density increases, the backhaul capacity becomes the system bottleneck. In order to alleviate such bottleneck we propose a system where helpers with low-rate backhaul but high storage capacity cache popular video files. Files not available from helpers are transmitted by the cellular base station. We analyze the optimum way of assigning files to the helpers, in order to minimize the expected downloading time for files. We distinguish between the uncoded case (where only complete files are stored) and the coded case, where segments of Fountain-encoded versions of the video files are stored at helpers. We show that the uncoded optimum file assignment is NP-hard, and develop a greedy strategy that is provably within a factor 2 of the optimum. Further, for a special case we provide an efficient algorithm achieving a provably better approximation ratio of 1-(1-1/d )d, where d is the maximum number of helpers a user can be connected to. We also show that the coded optimum cache assignment problem is convex that can be further reduced to a linear program. We present numerical results comparing the proposed schemes.

1,331 citations

Proceedings ArticleDOI
21 Oct 2001
TL;DR: PAST as mentioned in this paper is a large-scale P2P persistent storage utility based on a self-organizing, Internet-based overlay network of storage nodes that cooperatively route file queries, store multiple replicas of files, and cache additional copies of popular files.
Abstract: This paper presents and evaluates the storage management and caching in PAST, a large-scale peer-to-peer persistent storage utility. PAST is based on a self-organizing, Internet-based overlay network of storage nodes that cooperatively route file queries, store multiple replicas of files, and cache additional copies of popular files.In the PAST system, storage nodes and files are each assigned uniformly distributed identifiers, and replicas of a file are stored at nodes whose identifier matches most closely the file's identifier. This statistical assignment of files to storage nodes approximately balances the number of files stored on each node. However, non-uniform storage node capacities and file sizes require more explicit storage load balancing to permit graceful behavior under high global storage utilization; likewise, non-uniform popularity of files requires caching to minimize fetch distance and to balance the query load.We present and evaluate PAST, with an emphasis on its storage management and caching system. Extensive trace-driven experiments show that the system minimizes fetch distance, that it balances the query load for popular files, and that it displays graceful degradation of performance as the global storage utilization increases beyond 95%.

1,298 citations

References
More filters
Proceedings Article
08 Dec 1997
TL;DR: GreedyDual-Size as discussed by the authors incorporates locality with cost and size concerns in a simple and nonparameterized fashion for high performance, which can potentially improve the performance of main-memory caching of Web documents.
Abstract: Web caches can not only reduce network traffic and downloading latency, but can also affect the distribution of web traffic over the network through cost-aware caching. This paper introduces GreedyDual-Size, which incorporates locality with cost and size concerns in a simple and non-parameterized fashion for high performance. Trace-driven simulations show that with the appropriate cost definition, GreedyDual-Size outperforms existing web cache replacement algorithms in many aspects, including hit ratios, latency reduction and network cost reduction. In addition, GreedyDual-Size can potentially improve the performance of main-memory caching of Web documents.

1,048 citations

Book
01 Oct 1973
TL;DR: As one of the part of book categories, operating systems theory always becomes the most wanted book.
Abstract: If you really want to be smarter, reading can be one of the lots ways to evoke and realize. Many people who like reading will have more knowledge and experiences. Reading can be a way to gain information from economics, politics, science, fiction, literature, religion, and many others. As one of the part of book categories, operating systems theory always becomes the most wanted book. Many people are absolutely searching for this book. It means that many love to read this kind of book.

670 citations


"Web caching and Zipf-like distribut..." refers background in this paper

  • ...model [ GCD73 ] in the early operating system paging studies [Den80]....

    [...]

  • ...pages have the same size, then the optimal replacement algorithm is to keep those pages with the highest probabilities in the cache [ GCD73 ]....

    [...]

01 Apr 1995
TL;DR: This paper presents a descriptive statistical summary of the traces of actual executions of NCSA Mosaic, and shows that many characteristics of WWW use can be modelled using power-law distributions, including the distribution of document sizes, the popularity of documents as a function of size, and the Distribution of user requests for documents.
Abstract: The explosion of WWW traffic necessitates an accurate picture of WWW use, and in particular requires a good understanding of client requests for WWW documents. To address this need, we have collected traces of actual executions of NCSA Mosaic, reflecting over half a million user requests for WWW documents. In this paper we present a descriptive statistical summary of the traces we collected, which identifies a number of trends and reference patterns in WWW use. In particular, we show that many characteristics of WWW use can be modelled using power-law distributions, including the distribution of document sizes, the popularity of documents as a function of size, the distribution of user requests for documents, and the number of references to documents as a function of their overall rank in popularity (Zipf''s law). In addition, we show how the power-law distributions derived from our traces can be used to guide system designers interested in caching WWW documents. --- Our client-based traces are available via FTP from http://www.cs.bu.edu/techreports/1995-010-www-client-traces.tar.gz http://www.cs.bu.edu/techreports/1995-010-www-client-traces.a.tar.gz

624 citations


"Web caching and Zipf-like distribut..." refers background or methods or result in this paper

  • ...Many web caching studies reach this conclusion [1], [9], [4], [21], [10], [19], [5], [7]....

    [...]

  • ...[5] gathered 500,000 web accesses from the Computer Science Department at Boston University and observed that the requests follow an =i distribution where = 0:986, which is quite close to the true Zipf's law....

    [...]

  • ...Zipf's law to the distribution of web requests [5], [1]....

    [...]

  • ...Several early studies have supported this claim [9], [5], [1] while other recent studies have suggested otherwise [16], [2]....

    [...]

Proceedings ArticleDOI
01 Dec 1996
TL;DR: The authors propose models for both temporal and spatial locality of reference in streams of requests arriving at Web servers and show that temporal locality can be characterized by the marginal distribution of the stack distance trace, and proposed models for typical distributions and compare their cache performance to the traces.
Abstract: The authors propose models for both temporal and spatial locality of reference in streams of requests arriving at Web servers. They show that simple models based on document popularity alone are insufficient for capturing either temporal or spatial locality. Instead, they rely on an equivalent, but numerical, representation of a reference stream: a stack distance trace. They show that temporal locality can be characterized by the marginal distribution of the stack distance trace, and propose models for typical distributions and compare their cache performance to the traces. They also show that spatial locality in a reference stream can be characterized using the notion of self-similarity. Self-similarity describes long-range correlations in the data set, which is a property that previous researchers have found hard to incorporate into synthetic reference strings. They show that stack distance strings appear to be strongly self-similar, and provide measurements of the degree of self-similarity in the traces. Finally, they discuss methods for generating synthetic Web traces that exhibit the properties of temporal and spatial locality measured in the data.

529 citations

Journal ArticleDOI
TL;DR: This paper outlines the argument why it is unlikely that anyone will find a cheaper nonlookahead memory policy that delivers significantly better performance and suggests that a working set dispatcher should be considered.
Abstract: A program's working set is the collection of segments (or pages) recently referenced. This concept has led to efficient methods for measuring a program's intrinsic memory demand; it has assisted in undetstanding and in modeling program behavior; and it has been used as the basis of optimal multiprogrammed memory management. The total cost of a working set dispatcher is no larger than the total cost of other common dispatchers. This paper outlines the argument why it is unlikely that anyone will find a cheaper nonlookahead memory policy that delivers significantly better performance.

405 citations