scispace - formally typeset
Search or ask a question

Showing papers on "Cache invalidation published in 2004"


Proceedings ArticleDOI
29 Sep 2004
TL;DR: It is found that optimizing fairness usually increases throughput, while maximizing throughput does not necessarily improve fairness, and two algorithms are proposed that optimize fairness.
Abstract: This paper presents a detailed study of fairness in cache sharing between threads in a chip multiprocessor (CMP) architecture. Prior work in CMP architectures has only studied throughput optimization techniques for a shared cache. The issue of fairness in cache sharing, and its relation to throughput, has not been studied. Fairness is a critical issue because the operating system (OS) thread scheduler's effectiveness depends on the hardware to provide fair cache sharing to co-scheduled threads. Without such hardware, serious problems, such as thread starvation and priority inversion, can arise and render the OS scheduler ineffective. This paper makes several contributions. First, it proposes and evaluates five cache fairness metrics that measure the degree of fairness in cache sharing, and shows that two of them correlate very strongly with the execution-time fairness. Execution-time fairness is defined as how uniform the execution times of co-scheduled threads are changed, where each change is relative to the execution time of the same thread running alone. Secondly, using the metrics, the paper proposes static and dynamic L2 cache partitioning algorithms that optimize fairness. The dynamic partitioning algorithm is easy to implement, requires little or no profiling, has low overhead, and does not restrict the cache replacement algorithm to LRU. The static algorithm, although requiring the cache to maintain LRU stack information, can help the OS thread scheduler to avoid cache thrashing. Finally, this paper studies the relationship between fairness and throughput in detail. We found that optimizing fairness usually increases throughput, while maximizing throughput does not necessarily improve fairness. Using a set of co-scheduled pairs of benchmarks, on average our algorithms improve fairness by a factor of 4/spl times/, while increasing the throughput by 15%, compared to a nonpartitioned shared cache.

544 citations


Journal ArticleDOI
TL;DR: The results show that smart cache management and scheduling is essential to achieve high performance with shared cache memory and can improve the total IPC significantly over the standard least recently used (LRU) replacement policy.
Abstract: This paper proposes dynamic cache partitioning amongst simultaneously executing processes/threads. We present a general partitioning scheme that can be applied to set-associative caches. Since memory reference characteristics of processes/threads can change over time, our method collects the cache miss characteristics of processes/threads at run-time. Also, the workload is determined at run-time by the operating system scheduler. Our scheme combines the information, and partitions the cache amongst the executing processes/threads. Partition sizes are varied dynamically to reduce the total number of misses. The partitioning scheme has been evaluated using a processor simulator modeling a two-processor CMP system. The results show that the scheme can improve the total IPC significantly over the standard least recently used (LRU) replacement policy. In a certain case, partitioning doubles the total IPC over standard LRU. Our results show that smart cache management and scheduling is essential to achieve high performance with shared cache memory.

402 citations


Journal ArticleDOI
02 Mar 2004
TL;DR: An adaptive policy that dynamically adapts to the costs and benefits of cache compression is developed and it is shown that compression can improve performance for memory-intensive commercial workloads by up to 17%.
Abstract: Modern processors use two or more levels ofcache memories to bridge the rising disparity betweenprocessor and memory speeds. Compression canimprove cache performance by increasing effectivecache capacity and eliminating misses. However,decompressing cache lines also increases cache accesslatency, potentially degrading performance.In this paper, we develop an adaptive policy thatdynamically adapts to the costs and benefits of cachecompression. We propose a two-level cache hierarchywhere the L1 cache holds uncompressed data and the L2cache dynamically selects between compressed anduncompressed storage. The L2 cache is 8-way set-associativewith LRU replacement, where each set can storeup to eight compressed lines but has space for only fouruncompressed lines. On each L2 reference, the LRUstack depth and compressed size determine whethercompression (could have) eliminated a miss or incurs anunnecessary decompression overhead. Based on thisoutcome, the adaptive policy updates a single globalsaturating counter, which predicts whether to allocatelines in compressed or uncompressed form.We evaluate adaptive cache compression usingfull-system simulation and a range of benchmarks. Weshow that compression can improve performance formemory-intensive commercial workloads by up to 17%.However, always using compression hurts performancefor low-miss-rate benchmarks-due to unnecessarydecompression overhead-degrading performance byup to 18%. By dynamically monitoring workload behavior,the adaptive policy achieves comparable benefitsfrom compression, while never degrading performanceby more than 0.4%.

304 citations


Journal ArticleDOI
Nimrod Megiddo1, Dharmendra S. Modha1
TL;DR: The self-tuning, low-overhead, scan-resistant adaptive replacement cache algorithm outperforms the least-recently-used algorithm by dynamically responding to changing access patterns and continually balancing between workload recency and frequency features.
Abstract: The self-tuning, low-overhead, scan-resistant adaptive replacement cache algorithm outperforms the least-recently-used algorithm by dynamically responding to changing access patterns and continually balancing between workload recency and frequency features. Caching, a fundamental metaphor in modern computing, finds wide application in storage systems, databases, Web servers, middleware, processors, file systems, disk drives, redundant array of independent disks controllers, operating systems, and other applications such as data compression and list updating. In a two-level memory hierarchy, a cache performs faster than auxiliary storage, but it is more expensive. Cost concerns thus usually limit cache size to a fraction of the auxiliary memory's size.

261 citations


Proceedings ArticleDOI
14 Feb 2004
TL;DR: This paper proposes several power-aware storage cache management algorithms that provide more opportunities for the underlying disk power management schemes to save energy and investigates the effects of four storage cache write policies on disk energy consumption.
Abstract: Reducing energy consumption is an important issue for data centers. Among the various components of a data center, storage is one of the biggest consumers of energy. Previous studies have shown that the average idle period for a server disk in a data center is very small compared to the time taken to spin down and spin up. This significantly limits the effectiveness of disk power management schemes. This paper proposes several power-aware storage cache management algorithms that provide more opportunities for the underlying disk power management schemes to save energy. More specifically, we present an off-line power-aware greedy algorithm that is more energy-efficient than Belady’s off-line algorithm (which minimizes cache misses only). We also propose an online power-aware cache replacement algorithm. Our trace-driven simulations show that, compared to LRU, our algorithm saves 16% more disk energy and provides 50% better average response time for OLTP I/O workloads. We have also investigated the effects of four storage cache write policies on disk energy consumption.

252 citations


01 Jan 2004
TL;DR: This work proposes and evaluates a simple significance-based compression scheme that has a low compression and decompression overhead and provides comparable compression ratios to more complex schemes that have higher cache hit latencies.
Abstract: With the widening gap between processor and memory speeds, memory system designers may find cache compression beneficial to increase cache capacity and reduce off-chip bandwidth. Most hardware compression algorithms fall into the dictionary-based category, which depend on building a dictionary and using its entries to encode repeated data values. Such algorithms are effective in compressing large data blocks and files. Cache lines, however, are typically short (32-256 bytes), and a per-line dictionary places a significant overhead that limits the compressibility and increases decompression latency of such algorithms. For such short lines, significance-based compression is an appealing alternative. We propose and evaluate a simple significance-based compression scheme that has a low compression and decompression overhead. This scheme, Frequent Pattern Compression (FPC) compresses individual cache lines on a word-by-word basis by storing common word patterns in a compressed format accompanied with an appropriate prefix. For a 64-byte cache line, compression can be completed in three cycles and decompression in five cycles, assuming 12 FO4 gate delays per cycle. We propose a compressed cache design in which data is stored in a compressed form in the L2 caches, but are uncompressed in the L1 caches. L2 cache lines are compressed to predetermined sizes that never exceed their original size to reduce decompression overhead. This simple scheme provides comparable compression ratios to more complex schemes that have higher cache hit latencies.

233 citations


Journal ArticleDOI
TL;DR: The results show that data and instruction caches require different control strategies for efficient execution, and a technique called cache subbank prediction, which is used to selectively wake up only the necessary parts of the instruction cache, while allowing most of the cache to stay in a low-leakage drowsy mode is proposed.
Abstract: On-chip caches represent a sizable fraction of the total power consumption of microprocessors. As feature sizes shrink, the dominant component of this power consumption will be leakage. However, during a fixed period of time, the activity in a data cache is only centered on a small subset of the lines. This behavior can be exploited to cut the leakage power of large data caches by putting the cold cache lines into a state preserving, low-power drowsey mode. In this paper, we investigate policies and circuit techniques for implementing drowsy data caches. We show that with simple microarchitectural techniques, about 80%-90% of the data cache lines can be maintained in a drowsy state without affecting performance by more than 0.6%, even though moving lines into and out of a drowsy state incurs a slight performance loss. According to our projections, in a 70-nm complementary metal-oxide-semiconductor process, drowsy data caches will be able to reduce the total leakage energy consumed in the caches by 60%-75%. In addition, we extend the drowsy cache concept to reduce leakage power of instruction caches without significant impact on execution time. Our results show that data and instruction caches require different control strategies for efficient execution. In order to enable drowsy instruction caches, we propose a technique called cache subbank prediction, which is used to selectively wake up only the necessary parts of the instruction cache, while allowing most of the cache to stay in a low-leakage drowsy mode. This prediction technique reduces the negative performance impact by 78% compared with the no-prediction policy. Our technique works well even with small predictor sizes and enables a 75% reduction of leakage energy in a 32-kB instruction cache.

189 citations


Patent
Franklin Davis1, William Beaty
08 Sep 2004
TL;DR: In this paper, a system and method for smart, persistent cache management of received content within a terminal is presented, where received content is tagged with cache directive allowing cache control to determine which of cache storage locations to use for storage of content.
Abstract: A system and method for smart, persistent cache management of received content within a terminal. Received content is tagged with cache directive allowing cache control to determine which of cache storage locations to use for storage of content. Cache control detects the number of instances that received content correlates to a newer version of purged content and provides the ability to re-classify cache persistence directive based upon the number of instances.

160 citations


Proceedings ArticleDOI
02 Apr 2004
TL;DR: The results show that the PLRU techniques can approximate and even outperform LRU with much lower complexity, for a wide range of cache organizations, however, a relatively large gap between LRU and optimal replacement policy, of up to 50%, indicates that new research aimed to close the gap is necessary.
Abstract: Replacement policy, one of the key factors determining the effectiveness of a cache, becomes even more important with latest technological trends toward highly associative caches. The state-of-the-art processors employ various policies such as Random, Least Recently Used (LRU), Round-Robin, and PLRU (Pseudo LRU), indicating that there is no common wisdom about the best one. Optimal yet unattainable policy would replace cache memory block whose next reference is the farthest away in the future, among all memory blocks present in the set.In our quest for replacement policy as close to optimal as possible, we thoroughly explored the design space of existing replacement mechanisms using SimpleScalar toolset and SPEC CPU2000 benchmark suite, across wide range of cache sizes and organizations. In order to better understand the behavior of different policies, we introduced new measures, such as cumulative distribution of cache hits in the LRU stack. We also dynamically monitored the number of cache misses, per each 100000 instructions.Our results show that the PLRU techniques can approximate and even outperform LRU with much lower complexity, for a wide range of cache organizations. However, a relatively large gap between LRU and optimal replacement policy, of up to 50%, indicates that new research aimed to close the gap is necessary. The cumulative distribution of cache hits in the LRU stack indicates a very good potential for way prediction using LRU information, since the percentage of hits to the bottom of the LRU stack is relatively high.

158 citations


Journal ArticleDOI
TL;DR: This work investigates multiple approaches to effectively manage second-level buffer caches and reports a new local algorithm called multi-queue (MQ) that performs better than nine tested alternative algorithms for second-levels buffer caches, and a set of global algorithms that manage a multilevel buffer cache hierarchy globally and significantly improve second- level buffer cache hit ratios over corresponding local algorithms.
Abstract: Buffer caches are commonly used in servers to reduce the number of slow disk accesses or network messages. These buffer caches form a multilevel buffer cache hierarchy. In such a hierarchy, second-level buffer caches have different access patterns from first-level buffer caches because accesses to a second-level are actually misses from a first-level. Therefore, commonly used cache management algorithms such as the least recently used (LRU) replacement algorithm that work well for single-level buffer caches may not work well for second-level. We investigate multiple approaches to effectively manage second-level buffer caches. In particular, we report our research results in 1) second-level buffer cache access pattern characterization, 2) a new local algorithm called multi-queue (MQ) that performs better than nine tested alternative algorithms for second-level buffer caches, 3) a set of global algorithms that manage a multilevel buffer cache hierarchy globally and significantly improve second-level buffer cache hit ratios over corresponding local algorithms, and 4) implementation and evaluation of these algorithms in a real storage system connected with commercial database servers (Microsoft SQL server and Oracle) running industrial-strength online transaction processing benchmarks.

150 citations


Journal ArticleDOI
TL;DR: The problem of efficiently streaming a set of heterogeneous videos from a remote server through a proxy to multiple asynchronous clients so that they can experience playback with low startup delays is addressed.
Abstract: We address the problem of efficiently streaming a set of heterogeneous videos from a remote server through a proxy to multiple asynchronous clients so that they can experience playback with low startup delays. We determine the optimal proxy prefix cache allocation to the videos that minimizes the aggregate network bandwidth cost. We integrate proxy caching with traditional server-based reactive transmission schemes such as hatching, patching and stream merging to develop a set of proxy-assisted delivery schemes. We quantitatively explore the impact of the choice of transmission scheme, cache allocation policy, proxy cache size, and availability of unicast versus multicast capability, on the resulting transmission cost. Our evaluations show that even a relatively small prefix cache (10%-20% of the video repository) is sufficient to realize substantial savings in transmission cost. We find that carefully designed proxy-assisted reactive transmission schemes can produce significant cost savings even in a predominantly unicast environment such as the Internet.

Patent
28 Sep 2004
TL;DR: In this paper, a playback apparatus and associated method is disclosed for use in a reproducing system, the apparatus including a cache memory configured to store data read from a data source, a cache replacement unit (341) configured to identify certain of the data to be deleted from the cache memory (335) based on a determination of data source data's use in at least two play modes of the apparatus.
Abstract: A playback apparatus and associated method is disclosed for use in a reproducing system, the apparatus including a cache memory configured to store data read from a data source (1); a cache replacement unit (341) configured to identify certain of the data to be deleted from the cache memory (335) based on a determination of the data source data's use in at least two play modes of the apparatus; and a presentation unit (337) configured to obtain data from the cache memory (335) to be presented to a user. The playback apparatus further includes a disc control unit (343) configured to identify data to be read from the data source (1) to be stored in the cache memory (335) based on the current contents of the cache memory (335).

Proceedings ArticleDOI
14 Feb 2004
TL;DR: An in-depth analysis of the pathological behavior of cache hashing functions is presented and two new hashing functions are proposed: prime modulo and prime displacement that are resistant to pathological behavior and yet are able to eliminate the worst case conflict behavior in the L2 cache are proposed.
Abstract: Using alternative cache indexing/hashing functions is a popular technique to reduce conflict misses by achieving a more uniform cache access distribution across the sets in the cache. Although various alternative hashing functions have been demonstrated to eliminate the worst case conflict behavior, no study has really analyzed the pathological behavior of such hashing functions that often result in performance slowdown. We present an in-depth analysis of the pathological behavior of cache hashing functions. Based on the analysis, we propose two new hashing functions: prime modulo and prime displacement that are resistant to pathological behavior and yet are able to eliminate the worst case conflict behavior in the L2 cache. We show that these two schemes can be implemented in fast hardware using a set of narrow add operations, with negligible fragmentation in the L2 cache. We evaluate the schemes on 23 memory intensive applications. For applications that have nonuniform cache accesses, both prime modulo and prime displacement hashing achieve an average speedup of 1.27 compared to traditional hashing, without slowing down any of the 23 benchmarks. We also evaluate using multiple prime displacement hashing functions in conjunction with a skewed associative L2 cache. The skewed associative cache achieves a better average speedup at the cost of some pathological behavior that slows down four applications by up to 7%.

Proceedings ArticleDOI
14 Feb 2004
TL;DR: The spatial pattern predictor (SPP) is described, a cost-effective hardware mechanism that accurately predicts reference patterns within a spatial group at runtime, and requires only a small amount of predictor memory to store the predicted patterns.
Abstract: Recent research suggests that there are large variations in a cache's spatial usage, both within and across programs Unfortunately, conventional caches typically employ fixed cache line sizes to balance the exploitation of spatial and temporal locality, and to avoid prohibitive cache fill bandwidth demands The resulting inability of conventional caches to exploit spatial variations leads to suboptimal performance and unnecessary cache power dissipation We describe the spatial pattern predictor (SPP), a cost-effective hardware mechanism that accurately predicts reference patterns within a spatial group (ie, a contiguous region of data in memory) at runtime The key observation enabling an accurate, yet low-cost, SPP design is that spatial patterns correlate well with instruction addresses and data reference offsets within a cache line We require only a small amount of predictor memory to store the predicted patterns Simulation results for a 64-Kbyte 2-way set-associative Ll data cache with 64-byte lines show that: (1) a 256-entry tag-less direct-mapped SPP can achieve, on average, a prediction coverage of 95%, over-predicting the patterns by only 8%, (2) assuming a 70 nm process technology, the SPP helps reduce leakage energy in the base cache by 41% on average, incurring less than 1% performance degradation, and (3) prefetching spatial groups of up to 512 bytes using SPP improves execution time by 33% on average and up to a factor of two

Journal ArticleDOI
TL;DR: The paper employs stretch as the major performance metric since it accounts for the data service time and, thus, is fair when items have different sizes and proves that Min-SAUD achieves optimal stretch under some standard assumptions.
Abstract: Data caching at mobile clients is an important technique for improving the performance of wireless data dissemination systems. However, variable data sizes, data updates, limited client resources, and frequent client disconnections make cache management a challenge. We propose a gain-based cache replacement policy, Min-SAUD, for wireless data dissemination when cache consistency must be enforced before a cached item is used. Min-SAUD considers several factors that affect cache performance, namely, access probability, update frequency, data size, retrieval delay, and cache validation cost. The paper employs stretch as the major performance metric since it accounts for the data service time and, thus, is fair when items have different sizes. We prove that Min-SAUD achieves optimal stretch under some standard assumptions. Moreover, a series of simulation experiments have been conducted to thoroughly evaluate the performance of Min-SAUD under various system configurations. The simulation results show that, in most cases, the Min-SAUD replacement policy substantially outperforms two existing policies, namely, LRU and SAIU.

Patent
22 Nov 2004
TL;DR: In this paper, the authors propose a cache that keeps regularly accessed disk I/O data within RAM that forms part of a computer systems main memory, depending on the size of the access.
Abstract: The cache keeps regularly accessed disk I/O data within RAM that forms part of a computer systems main memory. The cache operates across a network of computers systems, maintaining cache coherency for the disk I/O devices that are shared by the multiple computer systems within that network. Read access for disk I/O data that is contained within the RAM is returned much faster than would occur if the disk I/O device was accessed directly. The data is held in one of three areas of the RAM for the cache, dependent on the size of the I/O access. The total RAM containing the three areas for the cache does not occupy a fixed amount of a computers main memory. The RAM for the cache grows to contain more disk I/O data on demand and shrinks when more of the main memory is required by the computer system for other uses. The user of the cache is allowed to specify which size of I/O access is allocated to the three areas for the RAM, along with a limit for the total amount of main memory that will be used by the cache at any one time.

Proceedings ArticleDOI
16 Feb 2004
TL;DR: This work introduces the two-level cache tuner, or TCaT - a heuristic for searching the huge solution space of possible configurations and shows the integrity of the heuristic across multiple memory configurations and even in the presence of hardware/software partitioning.
Abstract: The power consumed by the memory hierarchy of a microprocessor can contribute to as much as 50% of the total microprocessor system power, and is thus a good candidate for optimizations. We present an automated method for tuning two-level caches to embedded applications for reduced energy consumption. The method is applicable to both a simulation-based exploration environment and a hardware-based system prototyping environment. We introduce the two-level cache tuner, or TCaT - a heuristic for searching the huge solution space of possible configurations. The heuristic interlaces the exploration of the two cache levels and searches the various cache parameters in a specific order based on their impact on energy. We show the integrity of our heuristic across multiple memory configurations and even in the presence of hardware/software partitioning -- a common optimization capable of achieving significant speedups and/or reduced energy consumption. We apply our exploration heuristic to a large set of embedded applications. Our experiments demonstrate the efficacy of our heuristic: on average the heuristic examines only 7% of the possible cache configurations, but results in cache sub-system energy savings of 53%, only 1% more than the optimal cache configuration. In addition, the configured cache achieves an average speedup of 30% over the base cache configuration due to tuning of cache line size to the application's needs.

Patent
30 Jun 2004
TL;DR: In this paper, a cache mechanism in a network redirector transparently intercepts requests to access server files, and if the requested file is locally cached, satisfies the request from the cache when possible, and also fills in a sparse cached file as reads for data in ranges that are missing in the cached file are requested and received from the server.
Abstract: An improved method and system for client-side caching that transparently caches suitable network files for offline use. A cache mechanism in a network redirector transparently intercepts requests to access server files, and if the requested file is locally cached, satisfies the request from the cache when possible. Otherwise the cache mechanism creates a local cache file and satisfies the request from the server, and also fills in a sparse cached file as reads for data in ranges that are missing in the cached file are requested and received from the server. A background process also fills in local files that are sparse, using the existing handle of already open server files, or opening, reading from and closing other server files. Security is also provided by maintaining security information received from the server for files that are in the cache, and using that security information to determine access to the file when offline.

Journal ArticleDOI
TL;DR: This work introduces on-chip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune the cache to an executing program, completely transparently to the programmer.
Abstract: Memory accesses often account for about half of a microprocessor system's power consumption. Customizing a microprocessor cache's total size, line size, and associativity to a particular program is well known to have tremendous benefits for performance and power. Customizing caches has until recently been restricted to core-based flows, in which a new chip will be fabricated. However, several configurable cache architectures have been proposed recently for use in prefabricated microprocessor platforms. Tuning those caches to a program is still, however, a cumbersome task left for designers, assisted in part by recent computer-aided design (CAD) tuning aids. We propose to move that CAD on-chip, which can greatly increase the acceptance of tunable caches. We introduce on-chip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune the cache to an executing program. Our heuristic seeks not only to reduce the number of configurations that must be examined, but also traverses the search space in a way that minimizes costly cache flushes. By simulating numerous Powerstone and MediaBench benchmarks, we show that such a dynamic self-tuning cache saves on average 40p of total memory access energy over a standard nontuned reference cache.

Proceedings ArticleDOI
27 Jun 2004
TL;DR: The perhaps surprising result that for sufficiently parallel computations the shared cache need only be an additive size larger than the single-processor cache is given, and some theoretical justification for designing machines with shared caches is given.
Abstract: We compare the number of cache misses M1 for running a computation on a single processor with cache size C1 to the total number of misses Mp for the same computation when using p processors or threads and a shared cache of size Cp. We show that for any computation, and with an appropriate (greedy) parallel schedule, if Cp ≥ C1 + pd then Mp ≤ M1. The depth d of the computation is the length of the critical path of dependences. This gives the perhaps surprising result that for sufficiently parallel computations the shared cache need only be an additive size larger than the single-processor cache, and gives some theoretical justification for designing machines with shared caches.We model a computation as a DAG and the sequential execution as a depth first schedule of the DAG. The parallel schedule we study is a parallel depth-first schedule (PDF schedule) based on the sequential one. The schedule is greedy and therefore work-efficient. Our main results assume the Ideal Cache model, but we also present results for other more realistic cache models.

Patent
24 Jan 2004
TL;DR: In this article, a method of caching commands in microprocessors having a plurality of arithmetic units and in modules having a two- or multidimensional cell arrangement is provided, which includes combining the plurality of cells and arithmetic units to form groups, assigning a cache unit to a group, and connecting the cache unit with a higher level unit via a tree structure.
Abstract: A method of caching commands in microprocessors having a plurality of arithmetic units and in modules having a two- or multidimensional cell arrangement is provided. The method includes combining a plurality of cells and arithmetic units to form a plurality of groups, assigning a cache unit to a group, and connecting the cache unit to a higher level unit via a tree structure. The cache unit may send requests for required commands to the higher level cache unit, which may return a command sequence including the required command, if the higher level cache unit holds the first command sequence including the required command in the higher level cache unit's local memory.

Book ChapterDOI
25 Oct 2004
TL;DR: This paper proposes a different cache architecture, intended to ease WCET analysis, where the cache stores complete methods and cache misses occur only on method invocation and return.
Abstract: Cache memories are mandatory to bridge the growing gap between CPU speed and main memory access time. Standard cache organizations improve the average execution time but are difficult to predict for worst case execution time (WCET) analysis. This paper proposes a different cache architecture, intended to ease WCET analysis. The cache stores complete methods and cache misses occur only on method invocation and return. Cache block replacement depends on the call tree, instead of instruction addresses.

Proceedings ArticleDOI
23 Mar 2004
TL;DR: A cache signature scheme is devised for COCA that provides hints for the mobile clients to determine whether a required data item is cached by their neighboring peers based on their local state, and is shown to be capable of effectively reducing the number of server requests and power consumption.
Abstract: Caching is a key technique for improving the data retrieval performance of mobile clients in mobile environments. The emergence of robust and reliable peer-to-peer (P2P) technologies now brings to reality what we call "cooperative caching " in which mobile clients can access data items from the cache in their neighboring peers. We discuss cooperative caching in mobile environments and proposes a cooperative caching scheme for mobile systems, called COCA. In COCA, we identify two types of mobile clients: low activity and high activity. They are referred to as low activity mobile client/host (LAM) and high activity mobile client/host (HAM) respectively. Both LAM and HAM can share their cache. The server replicates appropriate data items to LAMs so that HAMs can take advantages of the LAM replicas. The performance of pure COCA and COCA with the data replication scheme is evaluated through a number of simulated experiments which show that COCA significantly reduces both the server workload in terms of number of requests and the access miss ratio when the MHs reside outside of the service area. The COCA with the data replication scheme can improve the overall system performance in other aspects as well.

Proceedings ArticleDOI
13 Jun 2004
TL;DR: This work answers the question "Why does a database system incur so many instruction cache misses" and proposes techniques to buffer database operations during query execution to avoid instruction cache thrashing.
Abstract: As more and more query processing work can be done in main memory access is becoming a significant cost component of database operations. Recent database research has shown that most of the memory stalls are due to second-level cache data misses and first-level instruction cache misses. While a lot of research has focused on reducing the data cache misses, relatively little research has been done on improving the instruction cache performance of database systems.We first answer the question "Why does a database system incur so many instruction cache misses?" We demonstrate that current demand-pull pipelined query execution engines suffer from significant instruction cache thrashing between different operators. We propose techniques to buffer database operations during query execution to avoid instruction cache thrashing. We implement a new light-weight "buffer" operator and study various factors which may affect the cache performance. We also introduce a plan refinement algorithm that considers the query plan and decides whether it is beneficial to add additional "buffer" operators and where to put them. The benefit is mainly from better instruction locality and better hardware branch prediction. Our techniques can be easily integrated into current database systems without significant changes. Our experiments in a memory-resident PostgreSQL database system show that buffering techniques can reduce the number of instruction cache misses by up to 80% and improve query performance by up to 15%.

Patent
29 Sep 2004
TL;DR: In this paper, a system and method for providing dynamic mobile cache for mobile computing devices is described, where a cache is created at a server at the time a communication session between the server and a client is initiated.
Abstract: A system and method are described for providing dynamic mobile cache for mobile computing devices. In one embodiment, a cache is created at a server at the time a communication session between the server and a client is initiated. The server then determined whether the client requires the cache. If it is determined the client requires the cache, the server provides the cache to the client.

Proceedings ArticleDOI
16 Feb 2004
TL;DR: This work introduces on-chip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune the cache to an executing program, completely transparently to the programmer.
Abstract: Memory accesses can account for about half of a microprocessor system's power consumption. Customizing a microprocessor cache's total size, line size and associativity to a particular program is well known to have tremendous benefits for performance and power. Customizing caches has until recently been restricted to core-based flows, in which a new chip will be fabricated. However, several configurable cache architectures have been proposed recently for use in pre-fabricated microprocessor platforms. Tuning those caches to a program is still however a cumbersome task left for designers, assisted in part by recent computer-aided design (CAD) tuning aids. We propose to move that CAD on-chip, which can greatly increase the acceptance of configurable caches. We introduce on-chip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune the cache to an executing program. We carefully designed the heuristic to avoid any cache flushing, since flushing is power and performance costly. By simulating numerous Powerstone and MediaBench benchmarks, we show that such a dynamic self-tuning cache can reduce memory-access energy by 45% to 55% on average, and as much as 97%, compared with a four-way set-associative base cache, completely transparently to the programmer.

Patent
Sailesh Kottapalli1
29 Dec 2004
TL;DR: In this paper, an apparatus and method for fairly accessing a shared cache with multiple resources, such as multiple cores, multiple threads, or both, is described. But it is not discussed how to assign a static portion of the cache and a dynamic portion.
Abstract: An apparatus and method for fairly accessing a shared cache with multiple resources, such as multiple cores, multiple threads, or both are herein described. A resource within a microprocessor sharing access to a cache is assigned a static portion of the cache and a dynamic portion. The resource is blocked from victimizing static portions assigned to other resources, yet, allowed to victimize the static portion assigned to the resource and the dynamically shared portion. If the resource does not access the cache enough times over a period of time, the static portion assigned to the resource is reassigned to the dynamically shared portion.

Patent
18 Jun 2004
TL;DR: In this paper, a server provides Web responses that can include content from data tables in a database, and assigns a database cache dependency to at least a portion of a constructed Web response based on commands executed during construction of the Web response.
Abstract: A server provides Web responses that can include content from data tables in a database. The server maintains a cache (e.g., in system memory) that can store content (including content from data tables) so as to increase the efficiency of subsequently providing the same content to satisfy client Web requests. The server monitors data tables for changes and, when a change in a particular data table occurs, invalidates cached entries that depend on a particular data table. Further, in response to a client Web request for a Web response, the server assigns a database cache dependency to at least a portion of a constructed Web response (e.g., to content retrieved from a data table) based on commands executed during construction of the Web response. The at least a portion of the constructed Web response is subsequently cached in a cache location at the server.

Patent
16 Jun 2004
TL;DR: An ontology-based ad hoc service discovery system includes a local service cache, a cache manager, a service description unit, a query processor, a semantic inference unit and a node daemon as mentioned in this paper.
Abstract: An ontology-based ad hoc service discovery system includes a local service cache, a cache manager, a service description unit, a query processor, a service semantic inference unit and a node daemon. The local service cache restores a service ontology by collecting class information of all services advertised on an ad hoc network and stores the service ontology. The cache manager manages the local service cache and performs various preset operations on the cache. The service description unit stores a description of a corresponding service for use in initializing the local service cache. The query processor starts performing a semantic based service query protocol by receiving a service query from a user or an application program. The service semantic inference unit inspects whether the service query transmitted from a client is coincident with the content of the service. The node daemon performs a service cache synchronization protocol with neighboring nodes.

Journal ArticleDOI
TL;DR: A zero-aware SRAM cell with an asymmetric inverter pair, called ZA cell, to minimize the cache power consumption in writing "0", which is attractive in the data caches, which reveal the high write-zero rate.
Abstract: Most microprocessors employ the on-chip caches to bridge the performance gap between the processor and the main memory. However, the cache accesses usually contribute significantly to the total power consumption of the chip. Based on the observation that an overwhelming majority of the values written to the cache are "0", in this paper we propose a zero-aware SRAM cell with an asymmetric inverter pair, called ZA cell, to minimize the cache power consumption in writing "0". The ZA cell uses a circuit-level technique, which is software independent and orthogonal to other low-power techniques at architecture-level. Compared to the conventional SRAM cell, the experimental results based on the SPEC2000 and MediaBench traces show that without compromise of both performance and stability, the ZA cell can reduce the average cache write power consumption over 60% for both the baseline instruction and data caches. In particular, the ZA cell is attractive in the data caches, which reveal the high write-zero rate.