scispace - formally typeset
Search or ask a question

Showing papers on "Cache algorithms published in 2002"


Proceedings ArticleDOI
01 Oct 2002
TL;DR: This paper proposes physical designs for these Non-Uniform Cache Architectures (NUCAs) and extends these physical designs with logical policies that allow important data to migrate toward the processor within the same level of the cache.
Abstract: Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the hit time of large on-chip caches a function of a line's physical location within the cache. Consequently, cache access times will become a continuum of latencies rather than a single discrete latency. This non-uniformity can be exploited to provide faster access to cache lines in the portions of the cache that reside closer to the processor. In this paper, we evaluate a series of cache designs that provides fast hits to multi-megabyte cache memories. We first propose physical designs for these Non-Uniform Cache Architectures (NUCAs). We extend these physical designs with logical policies that allow important data to migrate toward the processor within the same level of the cache. We show that, for multi-megabyte level-two caches, an adaptive, dynamic NUCA design achieves 1.5 times the IPC of a Uniform Cache Architecture of any size, outperforms the best static NUCA scheme by 11%, outperforms the best three-level hierarchy--while using less silicon area--by 13%, and comes within 13% of an ideal minimal hit latency solution.

799 citations


Journal ArticleDOI
TL;DR: Both model-based and real trace simulation studies show that the proposed cooperative architecture results in more than 50% memory saving and substantial central processing unit (CPU) power saving for the management and update of cache entries compared with the traditional uncooperative hierarchical caching architecture.
Abstract: This paper aims at finding fundamental design principles for hierarchical Web caching. An analytical modeling technique is developed to characterize an uncooperative two-level hierarchical caching system where the least recently used (LRU) algorithm is locally run at each cache. With this modeling technique, we are able to identify a characteristic time for each cache, which plays a fundamental role in understanding the caching processes. In particular, a cache can be viewed roughly as a low-pass filter with its cutoff frequency equal to the inverse of the characteristic time. Documents with access frequencies lower than this cutoff frequency have good chances to pass through the cache without cache hits. This viewpoint enables us to take any branch of the cache tree as a tandem of low-pass filters at different cutoff frequencies, which further results in the finding of two fundamental design principles. Finally, to demonstrate how to use the principles to guide the caching algorithm design, we propose a cooperative hierarchical Web caching architecture based on these principles. Both model-based and real trace simulation studies show that the proposed cooperative architecture results in more than 50% memory saving and substantial central processing unit (CPU) power saving for the management and update of cache entries compared with the traditional uncooperative hierarchical caching architecture.

512 citations


Proceedings ArticleDOI
21 Jul 2002
TL;DR: This paper proposes and evaluates decentralized web caching algorithms for Squirrel, and discovers that it exhibits performance comparable to a centralized web cache in terms of hit ratio, bandwidth usage and latency.
Abstract: This paper presents a decentralized, peer-to-peer web cache called Squirrel. The key idea is to enable web browsers on desktop machines to share their local caches, to form an efficient and scalable web cache, without the need for dedicated hardware and the associated administrative cost. We propose and evaluate decentralized web caching algorithms for Squirrel, and discover that it exhibits performance comparable to a centralized web cache in terms of hit ratio, bandwidth usage and latency. It also achieves the benefits of decentralization, such as being scalable, self-organizing and resilient to node failures, while imposing low overhead on the participating nodes.

429 citations


Proceedings ArticleDOI
02 Feb 2002
TL;DR: A scheme that enables an accurate estimate of the isolated miss-rates of each process as a function of cache size under the standard LRU replacement policy is described, which can be used to schedule jobs or to partition the cache to minimize the overall miss-rate.
Abstract: We propose a low overhead, online memory monitoring scheme utilizing a set of novel hardware counters. The counters indicate the marginal gain in cache hits as the size of the cache is increased, which gives the cache miss-rate as a function of cache size. Using the counters, we describe a scheme that enables an accurate estimate of the isolated miss-rates of each process as a function of cache size under the standard LRU replacement policy. This information can be used to schedule jobs or to partition the cache to minimize the overall miss-rate. The data collected by the monitors can also be used by an analytical model of cache and memory behavior to produce a more accurate overall miss-rate for the collection of processes sharing a cache in both time and space. This overall miss-rate can be used to improve scheduling and partitioning schemes.

325 citations


ReportDOI
10 Jun 2002
TL;DR: In this article, the authors explore the benefits of a simple scheme to achieve exclusive caching, in which a data block is cached at either a client or the disk array, but not both.
Abstract: Modern high-end disk arrays often have several gigabytes of cache RAM. Unfortunately, most array caches use management policies which duplicate the same data blocks at both the client and array levels of the cache hierarchy: they are inclusive. Thus, the aggregate cache behaves as if it was only as big as the larger of the client and array caches, instead of as large as the sum of the two. Inclusiveness is wasteful: cache RAM is expensive. We explore the benefits of a simple scheme to achieve exclusive caching, in which a data block is cached at either a client or the disk array, but not both. Exclusiveness helps to create the effect of a single, large unified cache. We introduce a DEMOTE operation to transfer data ejected from the client to the array, and explore its effectiveness with simulation studies. We quantify the benefits and overheads of demotions across both synthetic and real-life workloads. The results show that we can obtain useful—sometimes substantial—speedups. During our investigation, we also developed some new cache-insertion algorithms that show promise for multiclient systems, and report on some of their properties.

285 citations


Posted Content
TL;DR: In this article, the idea of cache memory being used as a side-channel which leaks information during the run of a cryptographic algorithm has been investigated, and it has been shown that an attacker may be able to reveal or narrow the possible values of secret information held on the target device.
Abstract: We expand on the idea, proposed by Kelsey et al [?], of cache memory being used as a side-channel which leaks information during the run of a cryptographic algorithm By using this side-channel, an attacker may be able to reveal or narrow the possible values of secret information held on the target device We describe an attack which encrypts 2 chosen plaintexts on the target processor in order to collect cache profiles and then performs around 2 computational steps to recover the key As well as describing and simulating the theoretical attack, we discuss how hardware and algorithmic alterations can be used to defend against such techniques

260 citations


Proceedings ArticleDOI
07 Nov 2002
TL;DR: This work proposes an enhanced-clustering cache replacement scheme for use in place of LRU, which improved the request hit ratio dramatically while keeping the small average hops per successful request comparable to LRU.
Abstract: Efficient data retrieval in a peer-to-peer system like Freenet is a challenging problem. We study the impact of cache replacement policy on the performance of Freenet. We find that, with Freenet's LRU (least recently used) cache replacement, there is a steep reduction in the hit ratio with increasing load. Based on intuition from the small-world models and the recent theoretical results by Kleinberg, we propose an enhanced-clustering cache replacement scheme for use in place of LRU. Such a replacement scheme forces the routing tables to resemble neighbor relationships in a small-world acquaintance graph - clustering with light randomness. In our simulation, this new scheme improved the request hit ratio dramatically while keeping the small average hops per successful request comparable to LRU. A simple, highly idealized model of Freenet under clustering with light randomness proves that the expected message delivery time in Freenet is O(log/sup 2/n) if the routing tables satisfy the small-world model and have the size /spl theta/(log/sup 2/n).

183 citations


Journal ArticleDOI
TL;DR: A new performance criterion is introduced, called caching efficiency, and a generic method for location-dependent cache invalidation strategies is proposed, and two cache replacement policies, PA and PAID, are proposed.
Abstract: Mobile location-dependent information services (LDISs) have become increasingly popular in recent years. However, data caching strategies for LDISs have thus far received little attention. In this paper, we study the issues of cache invalidation and cache replacement for location-dependent data under a geometric location model. We introduce a new performance criterion, called caching efficiency, and propose a generic method for location-dependent cache invalidation strategies. In addition, two cache replacement policies, PA and PAID, are proposed. Unlike the conventional replacement policies, PA and PAID take into consideration the valid scope area of a data value. We conduct a series of simulation experiments to study the performance of the proposed caching schemes. The experimental results show that the proposed location-dependent invalidation scheme is very effective and the PA and PAID policies significantly outperform the conventional replacement policies.

172 citations


Proceedings ArticleDOI
18 Nov 2002
TL;DR: The architectural control mechanism of-the drowsy cache is extended to reduce leakage power consumption of instruction caches without significant impact on execution time and the results show that data and instruction caches require different control strategies for efficient execution.
Abstract: On-chip caches represent a sizeable fraction of the total power consumption of microprocessors. Although large caches can significantly improve performance, they have the potential to increase power consumption. As feature sizes shrink, the dominant component of this power loss will be leakage. In our previous work we have shown how the drowsy circuit - a simple, state-preserving, low-leakage circuit that relies on voltage scaling for leakage reduction - can be used to reduce the total energy consumption of data caches by more than 50%. In this paper, we extend the architectural control mechanism of the drowsy cache to reduce leakage power consumption of instruction caches without significant impact on execution time. Our results show that data and instruction caches require different control strategies for efficient execution. To enable drowsy instruction caches, we propose a technique called cache sub-bank prediction which is used to selectively wake up only the necessary parts of the instruction cache, while allowing most of the cache to stay in a low leakage drowsy mode. This prediction technique reduces the negative performance impact by 76% compared to the no-prediction policy. Our technique works well even with small predictor sizes and enables an 86% reduction of leakage energy in a 64 K byte instruction cache.

170 citations


Proceedings ArticleDOI
03 Dec 2002
TL;DR: This paper proposes two low-complexity algorithms for selecting the contents of statically-locked caches and evaluates their performances and compares them with those of a state of the art static cache analysis method.
Abstract: Cache memories have been extensively used to bridge the gap between high speed processors and relatively slow main memories However, they are a source of predictability problems because of their dynamic and adaptive behavior and thus need special attention to be used in hard-real time systems A lot of progress has been achieved in the last ten years to statically predict the worst-case behavior of applications with respect to caches in order to determine safe and precise bounds on task worst-case execution times (WCETs) and cache-related preemption delays An alternative approach to cope with caches in real-time systems is to statically lock their contents such that memory access times and cache-related preemption times are predictable In this paper, we propose two low-complexity algorithms for selecting the contents of statically-locked caches We evaluate their performances and compare them with those of a state of the art static cache analysis method

168 citations


Journal ArticleDOI
01 May 2002
TL;DR: The extent to which detailed timing characteristics of past memory reference events are strongly predictive of future program reference behavior is shown, and a family of time-keeping techniques that optimize behavior based on observations about particular cache time durations, such as the cache access interval or the cache dead time are proposed.
Abstract: Techniques for analyzing and improving memory referencing behavior continue to be important for achieving good overall program performance due to the ever-increasing performance gap between processors and main memory. This paper offers a fresh perspective on the problem of predicting and optimizing memory behavior. Namely, we show quantitatively the extent to which detailed timing characteristics of past memory reference events are strongly predictive of future program reference behavior. We propose a family of time-keeping techniques that optimize behavior based on observations about particular cache time durations, such as the cache access interval or the cache dead time. Timekeeping techniques can be used to build small simple, and high-accuracy (often 90% or more) predictors for identifying conflict misses, for predicting dead blocks, and even for estimating the time at which the next reference to a cache frame will occur and the address that will be accessed. Based on these predictors, we demonstrate two new and complementary time-based hardware structures: (1) a time-based victim cache that improves performance by only storing conflict miss lines with likely reuse, and (2) a time-based prefetching technique that hones in on the right address to prefetch, and the right time to schedule the prefetch. Our victim cache technique improves performance over previous proposals by better selections of what to place in the victim cache. Our prefetching technique outperforms similar prior hardware prefetching proposals, despite being orders of magnitude smaller. Overall, these techniques improve performance by more than 11% across the SPEC2000 benchmark suite.

Proceedings ArticleDOI
07 Nov 2002
TL;DR: The problem of efficiently streaming a set of heterogeneous videos from a remote server through a proxy to multiple asynchronous clients so that they can experience playback with low startup delays is addressed.
Abstract: In this paper, we address the problem of efficiently streaming a set of heterogeneous videos from a remote server through a proxy to multiple asynchronous clients so that they can experience playback with low startup delays. We develop a technique to analytically determine the optimal proxy prefix cache allocation to the videos that minimizes the aggregate network bandwidth cost. We integrate proxy caching with traditional server-based reactive transmission schemes such as batching, patching and stream merging to develop a set of proxy-assisted delivery schemes. We quantitatively explore the impact of the choice of transmission scheme, cache allocation policy, proxy cache size, and availability of unicast versus multicast capability, on the resultant transmission cost.. Our evaluations show that even a relatively small prefix cache (10%-20% of the video repository) is sufficient to realize substantial savings in transmission cost. We find that carefully designed proxy-assisted reactive transmission schemes can produce significant cost savings even in predominantly unicast environments such as the Internet.

Proceedings ArticleDOI
02 Feb 2002
TL;DR: A hybrid selective-sets-and-ways cache organization is proposed that always offers equal or better resizing granularity than both of previously proposed organizations, and the energy savings from resizing d-cache and i-cache together are investigated.
Abstract: Cache memories account for a significant fraction of a chip's overall energy dissipation. Recent research advocates using "resizable" caches to exploit cache requirement variability in applications to reduce cache size and eliminate energy dissipation in the cache's unused sections with minimal impact on performance. Current proposals for resizable caches fundamentally vary in two design aspects: (1) cache organization, where one organization, referred to as selective-ways, varies the cache's set-associativity, while the other, referred to as selective-sets, varies the number of cache sets, and (2) resizing strategy, where one proposal statically sets the cache size prior to an application's execution, while the other allows for dynamic resizing both within and across applications. In this paper, we compare and contrast, for the first time, the proposed design choices for resizable caches, and evaluate the effectiveness of cache resizings in reducing the overall energy-delay in deep-submicron processors. In addition, we propose a hybrid selective-sets-and-ways cache organization that always offers equal or better resizing granularity than both of previously proposed organizations. We also investigate the energy savings from resizing d-cache and i-cache together to characterize the interaction between d-cache and i-cache resizings.

Patent
27 Dec 2002
TL;DR: In this article, a log-structured write cache for a data storage system and method for improving the performance of the storage system is described, where cache lines where write data is temporarily accumulated in a nonvolatile state so that it can be sequentially written to the target storage locations at a later time.
Abstract: A log-structured write cache for a data storage system and method for improving the performance of the storage system are described. The system might be a RAID storage array, a disk drive, an optical disk, or a tape storage system. The write cache is preferably implemented in the main storage medium of the system, but can also be provided in other storage components of the system. The write cache includes cache lines where write data is temporarily accumulated in a non-volatile state so that it can be sequentially written to the target storage locations at a later time, thereby improving the overall performance of the system. Meta-data for each cache line is also maintained in the write cache. The meta-data includes the target sector address for each sector in the line and a sequence number that indicates the order in which data is posted to the cache lines. A buffer table entry is provided for each cache line. A hash table is used to search the buffer table for a sector address that is needed at each data read and write operation.

Journal ArticleDOI
TL;DR: This paper proposes a proactive cache management scheme that not only improves the cache hit ratio, the throughput, and the bandwidth utilization, but also reduces the query delay and the power consumption.
Abstract: Recent work has shown that invalidation report (IR)-based cache management is an attractive approach for mobile environments. However, the IR-based cache invalidation solution has some limitations, such as long query delay, low bandwidth utilization, and it is not suitable for applications where data change frequently. In this paper, we propose a proactive cache management scheme to address these issues. Instead of passively waiting, the clients intelligently prefetch the data that are most likely used in the future. Based on a novel prefetch-access ratio concept, the proposed scheme can dynamically optimize performance or power based on the available resources and the performance requirements. To deal with frequently updated data, different techniques (indexing and caching) are applied to handle different components of the data based on their update frequency. Detailed simulation experiments are carried out to evaluate the proposed methodology. Compared to previous schemes, our solution not only improves the cache hit ratio, the throughput, and the bandwidth utilization, but also reduces the query delay and the power consumption.

Patent
06 Aug 2002
TL;DR: In this article, the cache content storage and replacement policies for a distributed plurality of network edge caches are centrally determined by a content selection server that executes a first process over a bounded content domain against a predefined set of domain content identifiers.
Abstract: A network edge cache management system centrally determines cache content storage and replacement policies for a distributed plurality of network edge caches. The management system includes a content selection server that executes a first process over a bounded content domain against a predefined set of domain content identifiers to produce a meta-content description of the bounded content domain, a second process against the meta-content description to define a plurality of content groups representing respective content sub-sets of the bounded content domain, a third process to associate respective sets of predetermined cache management attributes with the plurality of content groups, and a fourth process to generate a plurality of cache control rule bases selectively storing identifications of the plurality of content groups and corresponding associated sets of the predetermined cache management attributes. The cache control rule bases are distributed to the plurality of network edge cache servers.

Journal ArticleDOI
11 Nov 2002
TL;DR: This paper proposes a new data organization model called PAX (Partition Attributes Across), that significantly improves cache performance by grouping together all values of each attribute within each page, and shows that PAX performs well across different memory system designs.
Abstract: Relational database systems have traditionally optimized for I/O performance and organized records sequentially on disk pages using the N-ary Storage Model (NSM) (a.k.a., slotted pages). Recent research, however, indicates that cache utilization and performance is becoming increasingly important on modern platforms. In this paper, we first demonstrate that in-page data placement is the key to high cache performance and that NSM exhibits low cache utilization on modern platforms. Next, we propose a new data organization model called PAX (Partition Attributes Across), that significantly improves cache performance by grouping together all values of each attribute within each page. Because PAX only affects layout inside the pages, it incurs no storage penalty and does not affect I/O behavior. According to our experimental results (which were obtained without using any indices on the participating relations), when compared to NSM: (a) PAX exhibits superior cache and memory bandwidth utilization, saving at least 75% of NSM's stall time due to data cache accesses; (b) range selection queries and updates on memory-resident relations execute 1725% faster; and (c) TPC-H queries involving I/O execute 1148% faster. Finally, we show that PAX performs well across different memory system designs.

Proceedings ArticleDOI
03 Jun 2002
TL;DR: Fractal prefetching B+-Trees (fpB+Trees) as discussed by the authors embeds cache-optimized trees within disk optimized trees, in order to optimize both cache and I/O performance.
Abstract: B+-Trees have been traditionally optimized for I/O performance with disk pages as tree nodes. Recently, researchers have proposed new types of B+-Trees optimized for CPU cache performance in main memory environments, where the tree node sizes are one or a few cache lines. Unfortunately, due primarily to this large discrepancy in optimal node sizes, existing disk-optimized B+-Trees suffer from poor cache performance while cache-optimized B+-Trees exhibit poor disk performance. In this paper, we propose fractal prefetching B+-Trees (fpB+-Trees), which embed "cache-optimized" trees within "disk-optimized" trees, in order to optimize both cache and I/O performance. We design and evaluate two approaches to breaking disk pages into cache-optimized nodes: disk-first and cache-first. These approaches are somewhat biased in favor of maximizing disk and cache performance, respectively, as demonstrated by our results. Both implementations of fpB+-Trees achieve dramatically better cache performance than disk-optimized B+-Trees: a factor of 1.1-1.8 improvement for search, up to a factor of 4.2 improvement for range scans, and up to a 20-fold improvement for updates, all without significant degradation of I/O performance. In addition, fpB+-Trees accelerate I/O performance for range scans by using jump-pointer arrays to prefetch leaf pages, thereby achieving a speed-up of 2.5-5 on IBM's DB2 Universal Database.

Proceedings ArticleDOI
18 Nov 2002
TL;DR: This paper proposes the use of a pointer cache, which tracks pointer transitions, to aid prefetching, and examines using the pointer cache in a wide issue superscalar processor as a value predictor and to aidPrefetching when a chain of pointers is being traversed.
Abstract: Data prefetching effectively reduces the negative effects of long load latencies on the performance of modern processors. Hardware prefetchers employ hardware structures to predict future memory addresses based on previous patterns. Thread-based prefetchers use portions of the actual program code to determine future load addresses for prefetching. This paper proposes the use of a pointer cache, which tracks pointer transitions, to aid prefetching. The pointer cache provides, for a given pointer's effective address, the base address of the object pointed to by the pointer. We examine using the pointer cache in a wide issue superscalar processor as a value predictor and to aid prefetching when a chain of pointers is being traversed. When a load misses in the L1 cache, but hits in the pointer cache, the first two cache blocks of the pointed to object are prefetched. In addition, the load's dependencies are broken by using the pointer cache hit as a value prediction. We also examine using the pointer cache to allow speculative precomputation to run farther ahead of the main thread of execution than in prior studies. Previously proposed thread-based prefetchers are limited in how far they can run ahead of the main thread when traversing a chain of recurrent dependent loads. When combined with the pointer cache, a speculative thread can make better progress ahead of the main thread, rapidly traversing data structures in the face of cache misses caused by pointer transitions.

Patent
25 Jan 2002
TL;DR: In this article, a system and computer implementable method for updating content on servers coupled to a network is described, which includes updating an origin server with a version of files used to provide content, retrieving data that indicates an action to be performed on one or more cache servers in conjunction with updating the origin server, and performing the action to update entries in the one/more cache servers.
Abstract: A system and computer implementable method for updating content on servers coupled to a network. The method includes updating an origin server with a version of files used to provide content, retrieving data that indicates an action to be performed on one or more cache servers in conjunction with updating the origin server, and performing the action to update entries in the one or more cache servers. Each entry in each cache server is associated with a subset of the content on the origin server and may include an expiration field and/or a time to live field. An example of a subset of content to which a cache entry may be associated is a Web page. Cache servers are not required to poll origin servers to determine whether new content is available. Cache servers may be pre-populated using push or pull techniques.

Journal ArticleDOI
TL;DR: A novel caching scheme that integrates both object placement and replacement policies and which makes caching decisions on all candidate sites in a coordinated fashion is proposed.
Abstract: Web caching is an important technique for reducing Internet access latency, network traffic, and server load. This paper investigates cache management strategies for the en-route web caching environment, where caches are associated with routing nodes in the network. We propose a novel caching scheme that integrates both object placement and replacement policies and which makes caching decisions on all candidate sites in a coordinated fashion. In our scheme, cache status information along the routing path of a request is used in dynamically determining where to cache the requested object and what to replace if there is not enough space. The object placement problem is formulated as an optimization problem and the optimal locations to cache the object are obtained using a low-cost dynamic programming algorithm. Extensive simulation experiments have been performed to evaluate the proposed scheme in terms of a wide range of performance metrics. The results show that the proposed scheme significantly outperforms existing algorithms which consider either object placement or replacement at individual caches only.

Patent
25 Mar 2002
TL;DR: In this article, the second tier cache memory includes a data ring interface and a snoop ring interface, which are coupled to the first-tier cache memory in a set of caches.
Abstract: A set of cache memory includes a set of first tier cache memory and a second tier cache memory. In the set of first tier cache memory each first tier cache memory is coupled to a compute engine in a set of compute engines. The second tier cache memory is coupled to each first tier cache memory in the set of first tier cache memory. The second tier cache memory includes a data ring interface and a snoop ring interface.

Patent
23 Aug 2002
TL;DR: In this paper, the authors present a method and apparatus for shared cache coherency for a chip multiprocessor or a multi-core system. But they do not specify the cache lines themselves.
Abstract: A method and apparatus for shared cache coherency for a chip multiprocessor or a multiprocessor system. In one embodiment, a multicore processor includes a plurality of processor cores, each having a private cache, and a shared cache. An internal snoop bus is coupled to each private cache and the shared cache to communicate data from each private cache to other private caches and the shared cache. In another embodiment, an apparatus includes a plurality of processor cores and a plurality of caches. One of the plurality of caches maintains cache lines in two different modified states. The first modified state indicates a most recent copy of a modified cache line, and the second modified state indicates a stale copy of the modified cache line.

Patent
03 Dec 2002
TL;DR: In this article, a cache management system comprises a cache adapted store data corresponding to a data source and a cache manager adapted to access a set of rules to determine a frequency for automatically updating the data in the cache.
Abstract: A cache management system comprises a cache adapted store data corresponding to a data source. The cache management system also comprises a cache manager adapted to access a set of rules to determine a frequency for automatically updating the data in the cache. The cache manager is also adapted to automatically communicate with the data source to update the data in the cache corresponding to the determined frequency.

Proceedings ArticleDOI
01 Jan 2002
TL;DR: This work investigates the complexity of finding the optimal placement of objects (or code) in the memory, in the sense that this placement reduces the cache misses to the minimum, and shows that this problem is one of the toughest amongst the interesting algorithmic problems in computer science.
Abstract: The growing gap between the speed of memory access and cache access has made cache misses an influential factor in program efficiency. Much effort has been spent recently on reducing the number of cache misses during program run. This effort includes wise rearranging of program code, cache-conscious data placement, and algorithmic modifications that improve the program cache behavior. In this work we investigate the complexity of finding the optimal placement of objects (or code) in the memory, in the sense that this placement reduces the cache misses to the minimum. We show that this problem is one of the toughest amongst the interesting algorithmic problems in computer science. In particular, suppose one is given a sequence of memory accesses and one has to place the data in the memory so as to minimize the number of cache misses for this sequence. We show that if P ≠ NP, then one cannot efficiently approximate the optimal solution even up to a very liberal approximation ratio. Thus, this problem joins the small family of extremely inapproximable optimization problems. The other two famous members in this family are minimum coloring and maximum clique.

Patent
19 Apr 2002
TL;DR: In this article, a streaming delivery accelerator (SDA) receives content from a content provider, caches at least part of the content, forming a cache file, and streams the cache file to a user.
Abstract: Systems and methods for streaming of multimedia files over a network are described. A streaming delivery accelerator (SDA) receives content from a content provider, caches at least part of the content, forming a cache file, and streams the cache file to a user. The described systems and methods are directed to separate (shred) the content into contiguous cache files suitable for streaming. The shredded cache files may have different transmission bit rates and/or different content, such as audio, text, etc. Checksums can migrate from the content file to the shredded cache files and between different network protocols without the need for recomputing the checksums.

Journal ArticleDOI
TL;DR: The authors present the least-unified value algorithm, which performs better than existing algorithms for replacing nonuniform data objects in wide-area distributed environments.
Abstract: Cache performance depends heavily on replacement algorithms, which dynamically select a suitable subset of objects for caching in a finite space. Developing such algorithms for wide-area distributed environments is challenging because, unlike traditional paging systems, retrieval costs and object sizes are not necessarily uniform. In a uniform caching environment, a replacement algorithm generally seeks to reduce cache misses, usually by replacing an object with the least likelihood of re-reference. In contrast, reducing total cost incurred due to cache misses is more important in nonuniform caching environments. The authors present the least-unified value algorithm, which performs better than existing algorithms for replacing nonuniform data objects in wide-area distributed environments.

Proceedings ArticleDOI
18 Nov 2002
TL;DR: This paper proposes the design of the Frequent Value Cache (FVC), a cache in which storing a frequent value requires few bits as they are stored in encoded form while all other values are storage in unencoded form using 32 bits.
Abstract: Recent work has shown that a small number of distinct frequently occurring values often account for a large portion of memory accesses. In this paper we demonstrate how this frequent value phenomenon can be exploited in designing a cache that trades off performance with energy efficiency. We propose the design of the Frequent Value Cache (FVC) in which storing a frequent value requires few bits as they are stored in encoded form while all other values are stored in unencoded form using 32 bits. The data array is partitioned into two arrays such that if a frequent value is accessed only the first data array is accessed; otherwise an additional cycle is needed to access the second data array. Experiments with some of the SPEC95 benchmarks show that on an average a 64 Kb/64-value FVC provides 28.8% reduction in Ll cache energy and 3.38% increase in execution time delay over a conventional 64 Kb cache.

Journal ArticleDOI
TL;DR: It is shown that the approaches proposed in this paper (referred to as selective caching), where only a few frames are cached, can also contribute to significant improvements in the overall performance.
Abstract: Proxy caching has been used to speed up Web browsing and reduce networking costs. In this paper, we study the extension of proxy caching techniques to streaming video applications. A trivial extension consists of storing complete video sequences in the cache. However, this may not be applicable in situations where the video objects are very large and proxy cache space is limited. We show that the approaches proposed in this paper (referred to as selective caching), where only a few frames are cached, can also contribute to significant improvements in the overall performance. In particular, we discuss two network environments for streaming video, namely, quality-of-service (QoS) networks and best-effort networks (Internet). For QoS networks, the video caching goal is to reduce the network bandwidth costs; for best-effort networks, the goal is to increase the robustness of continuous playback against poor network conditions (such as congestion, delay, and loss). Two different selective caching algorithms (SCQ and SCB) are proposed, one for each network scenario, to increase the relevant overall performance metric in each case, while requiring only a fraction of the video stream to be cached. The main contribution of our work is to provide algorithms that are efficient even when the buffer memory available at the client is limited. These algorithms are also scalable so that when changes in the environment occur it is possible, with low complexity, to modify the allocation of cache space to different video sequences.

Proceedings ArticleDOI
22 Sep 2002
TL;DR: This work presents several architectural techniques that exploit the data duplication across the different levels of cache hierarchy, and employs both state-preserving and state-destroying leakage control mechanisms to L2 subblocks when their data also exist in L1.
Abstract: Energy management is important for a spectrum of systems ranging from high-performance architectures to low-end mobile and embedded devices. With the increasing number of transistors, smaller feature sizes, lower supply and threshold voltages, the focus on energy optimization is shifting from dynamic to leakage energy. Leakage energy is of particular concern in dense cache memories that form a major portion of the transistor budget. In this work, we present several architectural techniques that exploit the data duplication across the different levels of cache hierarchy. Specifically, we employ both state-preserving (data-retaining) and state-destroying leakage control mechanisms to L2 subblocks when their data also exist in L1. Using a set of media and array-dominated applications, we demonstrate the effectiveness of the proposed techniques through cycle-accurate simulation. We also compare our schemes with the previously proposed cache decay policy. This comparison indicates that one of our schemes generates competitive results with cache decay.