scispace - formally typeset
Search or ask a question

Showing papers on "Cache algorithms published in 2004"


Proceedings ArticleDOI
29 Sep 2004
TL;DR: It is found that optimizing fairness usually increases throughput, while maximizing throughput does not necessarily improve fairness, and two algorithms are proposed that optimize fairness.
Abstract: This paper presents a detailed study of fairness in cache sharing between threads in a chip multiprocessor (CMP) architecture. Prior work in CMP architectures has only studied throughput optimization techniques for a shared cache. The issue of fairness in cache sharing, and its relation to throughput, has not been studied. Fairness is a critical issue because the operating system (OS) thread scheduler's effectiveness depends on the hardware to provide fair cache sharing to co-scheduled threads. Without such hardware, serious problems, such as thread starvation and priority inversion, can arise and render the OS scheduler ineffective. This paper makes several contributions. First, it proposes and evaluates five cache fairness metrics that measure the degree of fairness in cache sharing, and shows that two of them correlate very strongly with the execution-time fairness. Execution-time fairness is defined as how uniform the execution times of co-scheduled threads are changed, where each change is relative to the execution time of the same thread running alone. Secondly, using the metrics, the paper proposes static and dynamic L2 cache partitioning algorithms that optimize fairness. The dynamic partitioning algorithm is easy to implement, requires little or no profiling, has low overhead, and does not restrict the cache replacement algorithm to LRU. The static algorithm, although requiring the cache to maintain LRU stack information, can help the OS thread scheduler to avoid cache thrashing. Finally, this paper studies the relationship between fairness and throughput in detail. We found that optimizing fairness usually increases throughput, while maximizing throughput does not necessarily improve fairness. Using a set of co-scheduled pairs of benchmarks, on average our algorithms improve fairness by a factor of 4/spl times/, while increasing the throughput by 15%, compared to a nonpartitioned shared cache.

544 citations


Journal ArticleDOI
TL;DR: The results show that smart cache management and scheduling is essential to achieve high performance with shared cache memory and can improve the total IPC significantly over the standard least recently used (LRU) replacement policy.
Abstract: This paper proposes dynamic cache partitioning amongst simultaneously executing processes/threads. We present a general partitioning scheme that can be applied to set-associative caches. Since memory reference characteristics of processes/threads can change over time, our method collects the cache miss characteristics of processes/threads at run-time. Also, the workload is determined at run-time by the operating system scheduler. Our scheme combines the information, and partitions the cache amongst the executing processes/threads. Partition sizes are varied dynamically to reduce the total number of misses. The partitioning scheme has been evaluated using a processor simulator modeling a two-processor CMP system. The results show that the scheme can improve the total IPC significantly over the standard least recently used (LRU) replacement policy. In a certain case, partitioning doubles the total IPC over standard LRU. Our results show that smart cache management and scheduling is essential to achieve high performance with shared cache memory.

402 citations


Journal ArticleDOI
02 Mar 2004
TL;DR: An adaptive policy that dynamically adapts to the costs and benefits of cache compression is developed and it is shown that compression can improve performance for memory-intensive commercial workloads by up to 17%.
Abstract: Modern processors use two or more levels ofcache memories to bridge the rising disparity betweenprocessor and memory speeds. Compression canimprove cache performance by increasing effectivecache capacity and eliminating misses. However,decompressing cache lines also increases cache accesslatency, potentially degrading performance.In this paper, we develop an adaptive policy thatdynamically adapts to the costs and benefits of cachecompression. We propose a two-level cache hierarchywhere the L1 cache holds uncompressed data and the L2cache dynamically selects between compressed anduncompressed storage. The L2 cache is 8-way set-associativewith LRU replacement, where each set can storeup to eight compressed lines but has space for only fouruncompressed lines. On each L2 reference, the LRUstack depth and compressed size determine whethercompression (could have) eliminated a miss or incurs anunnecessary decompression overhead. Based on thisoutcome, the adaptive policy updates a single globalsaturating counter, which predicts whether to allocatelines in compressed or uncompressed form.We evaluate adaptive cache compression usingfull-system simulation and a range of benchmarks. Weshow that compression can improve performance formemory-intensive commercial workloads by up to 17%.However, always using compression hurts performancefor low-miss-rate benchmarks-due to unnecessarydecompression overhead-degrading performance byup to 18%. By dynamically monitoring workload behavior,the adaptive policy achieves comparable benefitsfrom compression, while never degrading performanceby more than 0.4%.

304 citations


Journal ArticleDOI
02 Mar 2004
TL;DR: A proposed performance model for superscalar processors consists of a component that models the relationship between instructions issued per cycle and the size of the instruction window under ideal conditions and methods for calculating transient performance penalties due to branch mispredictions, instruction cache misses, and data cache misses.
Abstract: A proposed performance model for superscalar processorsconsists of 1) a component that models the relationshipbetween instructions issued per cycle and the sizeof the instruction window under ideal conditions, and 2)methods for calculating transient performance penaltiesdue to branch mispredictions, instruction cache misses,and data cache misses.Using trace-derived data dependenceinformation, data and instruction cache miss rates,and branch miss-prediction rates as inputs, the model canarrive at performance estimates for a typical superscalarprocessor that are within 5.8% of detailed simulation onaverage and within 13% in the worst case. The modelalso provides insights into the workings of superscalarprocessors and long-term microarchitecture trends such aspipeline depths and issue widths.

295 citations


Proceedings ArticleDOI
Ravi Iyer1
26 Jun 2004
TL;DR: A new cache management framework (CQoS) that recognizes the heterogeneity in memory access streams, introduces the notion of QoS to handle the varying degrees of locality and latency sensitivity and assigns and enforces priorities to streams based on latency sensitivity, locality degree and application performance needs is presented.
Abstract: Cache hierarchies have been traditionally designed for usage by a single application, thread or core. As multi-threaded (MT) and multi-core (CMP) platform architectures emerge and their workloads range from single-threaded and multithreaded applications to complex virtual machines (VMs), a shared cache resource will be consumed by these different entities generating heterogeneous memory access streams exhibiting different locality properties and varying memory sensitivity. As a result, conventional cache management approaches that treat all memory accesses equally are bound to result in inefficient space utilization and poor performance even for applications with good locality properties. To address this problem, this paper presents a new cache management framework (CQoS) that (1) recognizes the heterogeneity in memory access streams, (2) introduces the notion of QoS to handle the varying degrees of locality and latency sensitivity and (3) assigns and enforces priorities to streams based on latency sensitivity, locality degree and application performance needs. To achieve this, we propose CQoS options for priority classification, priority assignment and priority enforcement. We briefly describe CQoS priority classification and assignment options -- ranging from user-driven and developer-driven to compiler-detected and flow-based approaches. Our focus in this paper is on CQoS mechanisms for priority enforcement -- these include (1) selective cache allocation, (2) static/dynamic set partitioning and (3) heterogeneous cache regions. We discuss the architectural design and implementation complexity of these CQoS options. To evaluate the performance trade-offs for these options, we have modeled these CQoS options in a cache simulator and evaluated their performance in CMP platforms running network-intensive server workloads. Our simulation results show the effectiveness of our proposed options and make the case for CQoS in future multi-threaded/multi-core platforms since it improves shared cache efficiency and increases overall system performance as a result.

291 citations


Journal ArticleDOI
Nimrod Megiddo1, Dharmendra S. Modha1
TL;DR: The self-tuning, low-overhead, scan-resistant adaptive replacement cache algorithm outperforms the least-recently-used algorithm by dynamically responding to changing access patterns and continually balancing between workload recency and frequency features.
Abstract: The self-tuning, low-overhead, scan-resistant adaptive replacement cache algorithm outperforms the least-recently-used algorithm by dynamically responding to changing access patterns and continually balancing between workload recency and frequency features. Caching, a fundamental metaphor in modern computing, finds wide application in storage systems, databases, Web servers, middleware, processors, file systems, disk drives, redundant array of independent disks controllers, operating systems, and other applications such as data compression and list updating. In a two-level memory hierarchy, a cache performs faster than auxiliary storage, but it is more expensive. Cost concerns thus usually limit cache size to a fraction of the auxiliary memory's size.

261 citations


Proceedings Article
31 Mar 2004
TL;DR: A simple and elegant new algorithm, namely, CLOCK with Adaptive Replacement (CAR), that has several advantages over CLOCK: it is scan-resistant, self-tuning and it adaptively and dynamically captures the "recency" and "frequency" features of a workload.
Abstract: CLOCK is a classical cache replacement policy dating back to 1968 that was proposed as a low-complexity approximation to LRU. On every cache hit, the policy LRU needs to move the accessed item to the most recently used position, at which point, to ensure consistency and correctness, it serializes cache hits behind a single global lock. CLOCK eliminates this lock contention, and, hence, can support high concurrency and high throughput environments such as virtual memory (for example, Multics, UNIX, BSD, AIX) and databases (for example, DB2). Unfortunately, CLOCK is still plagued by disadvantages of LRU such as disregard for "frequency", susceptibility to scans, and low performance.As our main contribution, we propose a simple and elegant new algorithm, namely, CLOCK with Adaptive Replacement (CAR), that has several advantages over CLOCK: (i) it is scan-resistant; (ii) it is self-tuning and it adaptively and dynamically captures the "recency" and "frequency" features of a workload; (iii) it uses essentially the same primitives as CLOCK, and, hence, is low-complexity and amenable to a high-concurrency implementation; and (iv) it outperforms CLOCK across a wide-range of cache sizes and workloads. The algorithm CAR is inspired by the Adaptive Replacement Cache (ARC) algorithm, and inherits virtually all advantages of ARC including its high performance, but does not serialize cache hits behind a single global lock. As our second contribution, we introduce another novel algorithm, namely, CAR with Temporal filtering (CART), that has all the advantages of CAR, but, in addition, uses a certain temporal filter to distill pages with long-term utility from those with only short-term utility.

253 citations


Proceedings ArticleDOI
14 Feb 2004
TL;DR: This paper proposes several power-aware storage cache management algorithms that provide more opportunities for the underlying disk power management schemes to save energy and investigates the effects of four storage cache write policies on disk energy consumption.
Abstract: Reducing energy consumption is an important issue for data centers. Among the various components of a data center, storage is one of the biggest consumers of energy. Previous studies have shown that the average idle period for a server disk in a data center is very small compared to the time taken to spin down and spin up. This significantly limits the effectiveness of disk power management schemes. This paper proposes several power-aware storage cache management algorithms that provide more opportunities for the underlying disk power management schemes to save energy. More specifically, we present an off-line power-aware greedy algorithm that is more energy-efficient than Belady’s off-line algorithm (which minimizes cache misses only). We also propose an online power-aware cache replacement algorithm. Our trace-driven simulations show that, compared to LRU, our algorithm saves 16% more disk energy and provides 50% better average response time for OLTP I/O workloads. We have also investigated the effects of four storage cache write policies on disk energy consumption.

252 citations


01 Jan 2004
TL;DR: This work proposes and evaluates a simple significance-based compression scheme that has a low compression and decompression overhead and provides comparable compression ratios to more complex schemes that have higher cache hit latencies.
Abstract: With the widening gap between processor and memory speeds, memory system designers may find cache compression beneficial to increase cache capacity and reduce off-chip bandwidth. Most hardware compression algorithms fall into the dictionary-based category, which depend on building a dictionary and using its entries to encode repeated data values. Such algorithms are effective in compressing large data blocks and files. Cache lines, however, are typically short (32-256 bytes), and a per-line dictionary places a significant overhead that limits the compressibility and increases decompression latency of such algorithms. For such short lines, significance-based compression is an appealing alternative. We propose and evaluate a simple significance-based compression scheme that has a low compression and decompression overhead. This scheme, Frequent Pattern Compression (FPC) compresses individual cache lines on a word-by-word basis by storing common word patterns in a compressed format accompanied with an appropriate prefix. For a 64-byte cache line, compression can be completed in three cycles and decompression in five cycles, assuming 12 FO4 gate delays per cycle. We propose a compressed cache design in which data is stored in a compressed form in the L2 caches, but are uncompressed in the L1 caches. L2 cache lines are compressed to predetermined sizes that never exceed their original size to reduce decompression overhead. This simple scheme provides comparable compression ratios to more complex schemes that have higher cache hit latencies.

233 citations


Proceedings ArticleDOI
15 Apr 2004
TL;DR: Three new meta algorithms are proposed and compared against the de facto one and a recently proposed one by means of synthetic and trace-driven simulations, leading to improved performance under most simulated scenarios, especially under a low availability of storage.
Abstract: Large scale hierarchical caches for Web content have been deployed widely in an attempt to reduce delivery delays and bandwidth consumption and also to improve the scalability of content dissemination through the World Wide Web. Irrespective of the specific replacement algorithm employed in each cache, a de facto characteristic of contemporary hierarchical caches is that a hit for a document at an l-level cache leads to the caching of the document in all intermediate caches (levels l-1,..., 1) on the path towards the leaf cache that received the initial request. This paper presents various algorithms that revises this standard behavior and attempts to be more selective in choosing the caches that gets to store a local copy of the requested document. As these algorithms operate independently of the actual replacement algorithm running in each individual cache, they are referred to as meta algorithms. Three new meta algorithms are proposed and compared against the de facto one and a recently proposed one by means of synthetic and trace-driven simulations. The best of the new meta algorithms appears to be leading to improved performance under most simulated scenarios, especially under a low availability of storage. The latter observation makes the presented meta algorithms particularly favorable for the handling of large data objects such as stored music files or short video clips. Additionally, a simple load balancing algorithm that is based on the concept of meta algorithms is proposed and evaluated. The algorithm is shown to be able to provide for an effective balancing of load thus possibly addressing the recently discovered "filtering-effect" in hierarchical Web caches.

224 citations


Patent
22 Jul 2004
TL;DR: In this paper, the authors propose a method of updating a cache in an integrated circuit comprising: the cache; a processor connected to the cache via a cache bus; a memory interface connected to cache via the cache and to the processor via a second bus, the first bus being wider than the second bus or the cache bus.
Abstract: A method of updating a cache in an integrated circuit comprising: the cache; a processor connected to the cache via a cache bus; a memory interface connected to the cache via a first bus and to the processor via a second bus, the first bus being wider than the second bus or the cache bus; and memory connected to the memory interface via a memory bus; the method comprising the steps of: (a) following a cache miss, using the processor to issue a request for first data via a first address, the first data being that associated with the cache miss; (b) in response to the request, using the memory interface to fetch the first data from the memory, and sending the first data to the processor; (c) sending, from the memory interface and via the first bus, the first data and additional data, the additional data being that stored in the memory adjacent the first data; (d) updating the cache with the first data and the additional data via the first bus; and (e) updating flags in the cache associated with the first data and the additional data, such that the updated first data and additional data in the cache is valid

Patent
28 Jun 2004
TL;DR: In this article, a system and method for filtering of web-based content in a proxy cache server environment provides a local network having a client, a directory server and a proxy caching server that caches predetermined Internet-derived web content within the network.
Abstract: A system and method for filtering of web-based content in a proxy cache server environment provides a local network having a client, a directory server and a proxy cache server that caches predetermined Internet-derived web content within the network. When content is requested, it is vended to the client only if it meets predefined user policies for acceptability. These policies are implemented based upon one or more ratings lists provided by content rating vendors. The lists are downloaded to the network in whole or part, and cached for use in determining acceptability of content by a filter application. Ratings can be particularly based upon predetermined content categories. Caching occurs in a host or object cache for rapid access. Only if current ratings are not found in the host or object caches are ratings caches or vendors accessed for ratings. Ratings on requested content are then placed in the host or object cache for subsequent use. Object parsing or other techniques can be used to screen returned content that is unrated or otherwise allowed to pass to ensure that it is appropriate.

Journal ArticleDOI
TL;DR: The results show that data and instruction caches require different control strategies for efficient execution, and a technique called cache subbank prediction, which is used to selectively wake up only the necessary parts of the instruction cache, while allowing most of the cache to stay in a low-leakage drowsy mode is proposed.
Abstract: On-chip caches represent a sizable fraction of the total power consumption of microprocessors. As feature sizes shrink, the dominant component of this power consumption will be leakage. However, during a fixed period of time, the activity in a data cache is only centered on a small subset of the lines. This behavior can be exploited to cut the leakage power of large data caches by putting the cold cache lines into a state preserving, low-power drowsey mode. In this paper, we investigate policies and circuit techniques for implementing drowsy data caches. We show that with simple microarchitectural techniques, about 80%-90% of the data cache lines can be maintained in a drowsy state without affecting performance by more than 0.6%, even though moving lines into and out of a drowsy state incurs a slight performance loss. According to our projections, in a 70-nm complementary metal-oxide-semiconductor process, drowsy data caches will be able to reduce the total leakage energy consumed in the caches by 60%-75%. In addition, we extend the drowsy cache concept to reduce leakage power of instruction caches without significant impact on execution time. Our results show that data and instruction caches require different control strategies for efficient execution. In order to enable drowsy instruction caches, we propose a technique called cache subbank prediction, which is used to selectively wake up only the necessary parts of the instruction cache, while allowing most of the cache to stay in a low-leakage drowsy mode. This prediction technique reduces the negative performance impact by 78% compared with the no-prediction policy. Our technique works well even with small predictor sizes and enables a 75% reduction of leakage energy in a 32-kB instruction cache.

Patent
Franklin Davis1, William Beaty
08 Sep 2004
TL;DR: In this paper, a system and method for smart, persistent cache management of received content within a terminal is presented, where received content is tagged with cache directive allowing cache control to determine which of cache storage locations to use for storage of content.
Abstract: A system and method for smart, persistent cache management of received content within a terminal. Received content is tagged with cache directive allowing cache control to determine which of cache storage locations to use for storage of content. Cache control detects the number of instances that received content correlates to a newer version of purged content and provides the ability to re-classify cache persistence directive based upon the number of instances.

Proceedings ArticleDOI
02 Apr 2004
TL;DR: The results show that the PLRU techniques can approximate and even outperform LRU with much lower complexity, for a wide range of cache organizations, however, a relatively large gap between LRU and optimal replacement policy, of up to 50%, indicates that new research aimed to close the gap is necessary.
Abstract: Replacement policy, one of the key factors determining the effectiveness of a cache, becomes even more important with latest technological trends toward highly associative caches. The state-of-the-art processors employ various policies such as Random, Least Recently Used (LRU), Round-Robin, and PLRU (Pseudo LRU), indicating that there is no common wisdom about the best one. Optimal yet unattainable policy would replace cache memory block whose next reference is the farthest away in the future, among all memory blocks present in the set.In our quest for replacement policy as close to optimal as possible, we thoroughly explored the design space of existing replacement mechanisms using SimpleScalar toolset and SPEC CPU2000 benchmark suite, across wide range of cache sizes and organizations. In order to better understand the behavior of different policies, we introduced new measures, such as cumulative distribution of cache hits in the LRU stack. We also dynamically monitored the number of cache misses, per each 100000 instructions.Our results show that the PLRU techniques can approximate and even outperform LRU with much lower complexity, for a wide range of cache organizations. However, a relatively large gap between LRU and optimal replacement policy, of up to 50%, indicates that new research aimed to close the gap is necessary. The cumulative distribution of cache hits in the LRU stack indicates a very good potential for way prediction using LRU information, since the percentage of hits to the bottom of the LRU stack is relatively high.

Journal ArticleDOI
TL;DR: This work investigates multiple approaches to effectively manage second-level buffer caches and reports a new local algorithm called multi-queue (MQ) that performs better than nine tested alternative algorithms for second-levels buffer caches, and a set of global algorithms that manage a multilevel buffer cache hierarchy globally and significantly improve second- level buffer cache hit ratios over corresponding local algorithms.
Abstract: Buffer caches are commonly used in servers to reduce the number of slow disk accesses or network messages. These buffer caches form a multilevel buffer cache hierarchy. In such a hierarchy, second-level buffer caches have different access patterns from first-level buffer caches because accesses to a second-level are actually misses from a first-level. Therefore, commonly used cache management algorithms such as the least recently used (LRU) replacement algorithm that work well for single-level buffer caches may not work well for second-level. We investigate multiple approaches to effectively manage second-level buffer caches. In particular, we report our research results in 1) second-level buffer cache access pattern characterization, 2) a new local algorithm called multi-queue (MQ) that performs better than nine tested alternative algorithms for second-level buffer caches, 3) a set of global algorithms that manage a multilevel buffer cache hierarchy globally and significantly improve second-level buffer cache hit ratios over corresponding local algorithms, and 4) implementation and evaluation of these algorithms in a real storage system connected with commercial database servers (Microsoft SQL server and Oracle) running industrial-strength online transaction processing benchmarks.

Journal ArticleDOI
TL;DR: The problem of efficiently streaming a set of heterogeneous videos from a remote server through a proxy to multiple asynchronous clients so that they can experience playback with low startup delays is addressed.
Abstract: We address the problem of efficiently streaming a set of heterogeneous videos from a remote server through a proxy to multiple asynchronous clients so that they can experience playback with low startup delays. We determine the optimal proxy prefix cache allocation to the videos that minimizes the aggregate network bandwidth cost. We integrate proxy caching with traditional server-based reactive transmission schemes such as hatching, patching and stream merging to develop a set of proxy-assisted delivery schemes. We quantitatively explore the impact of the choice of transmission scheme, cache allocation policy, proxy cache size, and availability of unicast versus multicast capability, on the resulting transmission cost. Our evaluations show that even a relatively small prefix cache (10%-20% of the video repository) is sufficient to realize substantial savings in transmission cost. We find that carefully designed proxy-assisted reactive transmission schemes can produce significant cost savings even in a predominantly unicast environment such as the Internet.

Patent
28 Sep 2004
TL;DR: In this paper, a playback apparatus and associated method is disclosed for use in a reproducing system, the apparatus including a cache memory configured to store data read from a data source, a cache replacement unit (341) configured to identify certain of the data to be deleted from the cache memory (335) based on a determination of data source data's use in at least two play modes of the apparatus.
Abstract: A playback apparatus and associated method is disclosed for use in a reproducing system, the apparatus including a cache memory configured to store data read from a data source (1); a cache replacement unit (341) configured to identify certain of the data to be deleted from the cache memory (335) based on a determination of the data source data's use in at least two play modes of the apparatus; and a presentation unit (337) configured to obtain data from the cache memory (335) to be presented to a user. The playback apparatus further includes a disc control unit (343) configured to identify data to be read from the data source (1) to be stored in the cache memory (335) based on the current contents of the cache memory (335).

Proceedings ArticleDOI
09 Aug 2004
TL;DR: This work presents an adaptive error coding scheme that treats dirty and clean data cache blocks differently, and presents an early-write-back scheme that enhances the ability to use a less powerful error protection scheme for a longer time without sacrificing reliability.
Abstract: Energy-efficiency and reliability are two major design constraints influencing next generation system designs. In this work, we focus on the interaction between power consumption and reliability considering the on-chip data caches. First, we investigate the impact of two commonly used architectural-level leakage reduction approaches on the data reliability. Our results indicate that the leakage optimization techniques can have very different reliability behavior as compared to an original cache with no leakage optimizations. Next, we investigate on providing data reliability in an energy efficient fashion in the presence of soft-errors. In contrast to current commercial caches that treat and protect all data using the same error detection/correction mechanism, we present an adaptive error coding scheme that treats dirty and clean data cache blocks differently. Furthermore, we present an early-write-back scheme that enhances the ability to use a less powerful error protection scheme for a longer time without sacrificing reliability. Experimental results show that proposed schemes, when used in conjunction, can reduce dynamic energy of error protection components in L1 data cache by 11% on average without impacting the performance or reliability.

Proceedings ArticleDOI
14 Feb 2004
TL;DR: An in-depth analysis of the pathological behavior of cache hashing functions is presented and two new hashing functions are proposed: prime modulo and prime displacement that are resistant to pathological behavior and yet are able to eliminate the worst case conflict behavior in the L2 cache are proposed.
Abstract: Using alternative cache indexing/hashing functions is a popular technique to reduce conflict misses by achieving a more uniform cache access distribution across the sets in the cache. Although various alternative hashing functions have been demonstrated to eliminate the worst case conflict behavior, no study has really analyzed the pathological behavior of such hashing functions that often result in performance slowdown. We present an in-depth analysis of the pathological behavior of cache hashing functions. Based on the analysis, we propose two new hashing functions: prime modulo and prime displacement that are resistant to pathological behavior and yet are able to eliminate the worst case conflict behavior in the L2 cache. We show that these two schemes can be implemented in fast hardware using a set of narrow add operations, with negligible fragmentation in the L2 cache. We evaluate the schemes on 23 memory intensive applications. For applications that have nonuniform cache accesses, both prime modulo and prime displacement hashing achieve an average speedup of 1.27 compared to traditional hashing, without slowing down any of the 23 benchmarks. We also evaluate using multiple prime displacement hashing functions in conjunction with a skewed associative L2 cache. The skewed associative cache achieves a better average speedup at the cost of some pathological behavior that slows down four applications by up to 7%.

Proceedings ArticleDOI
14 Feb 2004
TL;DR: The spatial pattern predictor (SPP) is described, a cost-effective hardware mechanism that accurately predicts reference patterns within a spatial group at runtime, and requires only a small amount of predictor memory to store the predicted patterns.
Abstract: Recent research suggests that there are large variations in a cache's spatial usage, both within and across programs Unfortunately, conventional caches typically employ fixed cache line sizes to balance the exploitation of spatial and temporal locality, and to avoid prohibitive cache fill bandwidth demands The resulting inability of conventional caches to exploit spatial variations leads to suboptimal performance and unnecessary cache power dissipation We describe the spatial pattern predictor (SPP), a cost-effective hardware mechanism that accurately predicts reference patterns within a spatial group (ie, a contiguous region of data in memory) at runtime The key observation enabling an accurate, yet low-cost, SPP design is that spatial patterns correlate well with instruction addresses and data reference offsets within a cache line We require only a small amount of predictor memory to store the predicted patterns Simulation results for a 64-Kbyte 2-way set-associative Ll data cache with 64-byte lines show that: (1) a 256-entry tag-less direct-mapped SPP can achieve, on average, a prediction coverage of 95%, over-predicting the patterns by only 8%, (2) assuming a 70 nm process technology, the SPP helps reduce leakage energy in the base cache by 41% on average, incurring less than 1% performance degradation, and (3) prefetching spatial groups of up to 512 bytes using SPP improves execution time by 33% on average and up to a factor of two

Journal ArticleDOI
TL;DR: The paper employs stretch as the major performance metric since it accounts for the data service time and, thus, is fair when items have different sizes and proves that Min-SAUD achieves optimal stretch under some standard assumptions.
Abstract: Data caching at mobile clients is an important technique for improving the performance of wireless data dissemination systems. However, variable data sizes, data updates, limited client resources, and frequent client disconnections make cache management a challenge. We propose a gain-based cache replacement policy, Min-SAUD, for wireless data dissemination when cache consistency must be enforced before a cached item is used. Min-SAUD considers several factors that affect cache performance, namely, access probability, update frequency, data size, retrieval delay, and cache validation cost. The paper employs stretch as the major performance metric since it accounts for the data service time and, thus, is fair when items have different sizes. We prove that Min-SAUD achieves optimal stretch under some standard assumptions. Moreover, a series of simulation experiments have been conducted to thoroughly evaluate the performance of Min-SAUD under various system configurations. The simulation results show that, in most cases, the Min-SAUD replacement policy substantially outperforms two existing policies, namely, LRU and SAIU.

Patent
22 Nov 2004
TL;DR: In this paper, the authors propose a cache that keeps regularly accessed disk I/O data within RAM that forms part of a computer systems main memory, depending on the size of the access.
Abstract: The cache keeps regularly accessed disk I/O data within RAM that forms part of a computer systems main memory. The cache operates across a network of computers systems, maintaining cache coherency for the disk I/O devices that are shared by the multiple computer systems within that network. Read access for disk I/O data that is contained within the RAM is returned much faster than would occur if the disk I/O device was accessed directly. The data is held in one of three areas of the RAM for the cache, dependent on the size of the I/O access. The total RAM containing the three areas for the cache does not occupy a fixed amount of a computers main memory. The RAM for the cache grows to contain more disk I/O data on demand and shrinks when more of the main memory is required by the computer system for other uses. The user of the cache is allowed to specify which size of I/O access is allocated to the three areas for the RAM, along with a limit for the total amount of main memory that will be used by the cache at any one time.

Proceedings ArticleDOI
16 Feb 2004
TL;DR: This work introduces the two-level cache tuner, or TCaT - a heuristic for searching the huge solution space of possible configurations and shows the integrity of the heuristic across multiple memory configurations and even in the presence of hardware/software partitioning.
Abstract: The power consumed by the memory hierarchy of a microprocessor can contribute to as much as 50% of the total microprocessor system power, and is thus a good candidate for optimizations. We present an automated method for tuning two-level caches to embedded applications for reduced energy consumption. The method is applicable to both a simulation-based exploration environment and a hardware-based system prototyping environment. We introduce the two-level cache tuner, or TCaT - a heuristic for searching the huge solution space of possible configurations. The heuristic interlaces the exploration of the two cache levels and searches the various cache parameters in a specific order based on their impact on energy. We show the integrity of our heuristic across multiple memory configurations and even in the presence of hardware/software partitioning -- a common optimization capable of achieving significant speedups and/or reduced energy consumption. We apply our exploration heuristic to a large set of embedded applications. Our experiments demonstrate the efficacy of our heuristic: on average the heuristic examines only 7% of the possible cache configurations, but results in cache sub-system energy savings of 53%, only 1% more than the optimal cache configuration. In addition, the configured cache achieves an average speedup of 30% over the base cache configuration due to tuning of cache line size to the application's needs.

Patent
30 Jun 2004
TL;DR: In this paper, a cache mechanism in a network redirector transparently intercepts requests to access server files, and if the requested file is locally cached, satisfies the request from the cache when possible, and also fills in a sparse cached file as reads for data in ranges that are missing in the cached file are requested and received from the server.
Abstract: An improved method and system for client-side caching that transparently caches suitable network files for offline use. A cache mechanism in a network redirector transparently intercepts requests to access server files, and if the requested file is locally cached, satisfies the request from the cache when possible. Otherwise the cache mechanism creates a local cache file and satisfies the request from the server, and also fills in a sparse cached file as reads for data in ranges that are missing in the cached file are requested and received from the server. A background process also fills in local files that are sparse, using the existing handle of already open server files, or opening, reading from and closing other server files. Security is also provided by maintaining security information received from the server for files that are in the cache, and using that security information to determine access to the file when offline.

Journal ArticleDOI
TL;DR: This work introduces on-chip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune the cache to an executing program, completely transparently to the programmer.
Abstract: Memory accesses often account for about half of a microprocessor system's power consumption. Customizing a microprocessor cache's total size, line size, and associativity to a particular program is well known to have tremendous benefits for performance and power. Customizing caches has until recently been restricted to core-based flows, in which a new chip will be fabricated. However, several configurable cache architectures have been proposed recently for use in prefabricated microprocessor platforms. Tuning those caches to a program is still, however, a cumbersome task left for designers, assisted in part by recent computer-aided design (CAD) tuning aids. We propose to move that CAD on-chip, which can greatly increase the acceptance of tunable caches. We introduce on-chip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune the cache to an executing program. Our heuristic seeks not only to reduce the number of configurations that must be examined, but also traverses the search space in a way that minimizes costly cache flushes. By simulating numerous Powerstone and MediaBench benchmarks, we show that such a dynamic self-tuning cache saves on average 40p of total memory access energy over a standard nontuned reference cache.

Proceedings ArticleDOI
26 Jun 2004
TL;DR: Results show that PB-LRU without any parameter tuning provides similar or even better performance and energy savings than the previous power-aware algorithm with the best parameter setting for each workload.
Abstract: Energy consumption is an important concern at data centers, where storage systems consume a significant fraction of the total energy. A recent study proposed power-aware storage cache management to provide more opportunities for the underlying disk power management scheme to save energy. However, the on-line algorithm proposed in that study requires cumbersome parameter tuning for each workload and is therefore difficult to apply to real systems.This paper presents a new power-aware on-line algorithm called PB-LRU (Partition-Based LRU) that requires little parameter tuning. Our results with both real system and synthetic workloads show that PB-LRU without any parameter tuning provides similar or even better performance and energy savings than the previous power-aware algorithm with the best parameter setting for each workload.

Proceedings ArticleDOI
19 Apr 2004
TL;DR: This paper uses trace driven simulations to compare traditional cache replacement policies with new policies that try to exploit characteristics of the P2P file-sharing traffic generated by applications using the FastTrack protocol.
Abstract: Peer-to-peer (P2P) file-sharing applications generate a large part if not most of today's Internet traffic. The large volume of this traffic (thus the high potential benefits of caching) and the large cache sizes required (thus nontrivial costs associated with caching) only underline that efficient cache replacement policies are important in this case. P2P file-sharing traffic has several characteristics that distinguish it from well studied Web traffic and that require a focused study of efficient cache management policies. This paper uses trace driven simulations to compare traditional cache replacement policies with new policies that try to exploit characteristics of the P2P file-sharing traffic generated by applications using the FastTrack protocol.

Proceedings ArticleDOI
27 Jun 2004
TL;DR: The perhaps surprising result that for sufficiently parallel computations the shared cache need only be an additive size larger than the single-processor cache is given, and some theoretical justification for designing machines with shared caches is given.
Abstract: We compare the number of cache misses M1 for running a computation on a single processor with cache size C1 to the total number of misses Mp for the same computation when using p processors or threads and a shared cache of size Cp. We show that for any computation, and with an appropriate (greedy) parallel schedule, if Cp ≥ C1 + pd then Mp ≤ M1. The depth d of the computation is the length of the critical path of dependences. This gives the perhaps surprising result that for sufficiently parallel computations the shared cache need only be an additive size larger than the single-processor cache, and gives some theoretical justification for designing machines with shared caches.We model a computation as a DAG and the sequential execution as a depth first schedule of the DAG. The parallel schedule we study is a parallel depth-first schedule (PDF schedule) based on the sequential one. The schedule is greedy and therefore work-efficient. Our main results assume the Ideal Cache model, but we also present results for other more realistic cache models.

Patent
24 Jan 2004
TL;DR: In this article, a method of caching commands in microprocessors having a plurality of arithmetic units and in modules having a two- or multidimensional cell arrangement is provided, which includes combining the plurality of cells and arithmetic units to form groups, assigning a cache unit to a group, and connecting the cache unit with a higher level unit via a tree structure.
Abstract: A method of caching commands in microprocessors having a plurality of arithmetic units and in modules having a two- or multidimensional cell arrangement is provided. The method includes combining a plurality of cells and arithmetic units to form a plurality of groups, assigning a cache unit to a group, and connecting the cache unit to a higher level unit via a tree structure. The cache unit may send requests for required commands to the higher level cache unit, which may return a command sequence including the required command, if the higher level cache unit holds the first command sequence including the required command in the higher level cache unit's local memory.