scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
H.C. Young1, E.J. Shekita1, S. Ong1, L. Hu1, W.W. Hsu1 
31 May 1995
TL;DR: An implementation-independent instruction-initiated prefetch mechanism for I-cache and an automatic prefetch mechanisms for D-cache to hide the memory latency associated with cache misses are described.
Abstract: Cache misses are becoming relatively more expensive in modern processors. This is largely due do the fact that processor clock rates are increasing faster than the latency of main memory is improving. Prefetch has been used to hide memory latency. There are at least two kinds of prefetches - automatic prefetch and instruction-initiated prefetch. This paper described an implementation-independent instruction-initiated prefetch mechanism for I-cache and an automatic prefetch mechanism for D-cache to hide the memory latency associated with cache misses. Simulation results taken from execution traces of 5 commercial relational database management systems were used to illustrate the potential benefit of the proposed mechanisms.

12 citations

Proceedings ArticleDOI
19 Mar 2006
TL;DR: MESA is a novel cache indexing scheme that integrates dynamic page coloring with static skewed associativity to reduce conflicts in L2/L3 caches with a small degree of associativity and can provide as much as 76% improvement in IPC.
Abstract: The paper proposes MESA (Multicoloring with Embedded Skewed Associativity), a novel cache indexing scheme that integrates dynamic page coloring with static skewed associativity to reduce conflicts in L2/L3 caches with a small degree of associativity. MESA associates multiple cache pages (colors) with each virtual memory page and uses two-level skewed associativity, first to map a page to a different color in each bank of the cache, and then to disperse the lines of a page across the banks and within the colors of the page. MESA is a multi-grained cache indexing scheme that combines the best of two worlds, page coloring and skewed associativity. We also propose a novel cache management scheme based on page remapping, which uses cache miss imbalance between colors in each bank as the metric to track conflicts and trigger remapping. We evaluate MESA using 24 benchmarks from multiple application domains and with various degrees of sensitivity to conflict misses, on both an in-order issue processor (using complete system simulation) and an out-of-order issue processor (using SimpleScalar). MESA outperforms skewed associativity, prime modulo hashing, and dynamic page coloring schemes proposed earlier. Compared to a 4-way associative cache, MESA can provide as much as 76% improvement in IPC.

12 citations

01 Jan 1998
TL;DR: The focus of this work is on improving the cache performance for decision support system workloads where data fits mostly or completely in main memory and the first public domain cache-oriented query optimizer is proposed.

12 citations

Journal ArticleDOI
TL;DR: This article presents the design of a simple hardware-controlled, high performance cache system that offers high performance with low power consumption and low hardware cost, and a simple dynamic fetching mechanism with different fetch sizes.
Abstract: This article presents the design of a simple hardware-controlled, high performance cache system. The design supports fast access time, optimal utilization of temporal and spatial localities adaptive to given applications, and a simple dynamic fetching mechanism with different fetch sizes. Support for dynamically varying the fetch size makes the cache equally effective for general-purpose as well as multimedia applications. Our cache organization and operational mechanism are especially designed to maximize temporal locality and spatial locality, selectively and adaptively. Simulation shows that the average memory access time of the proposed cache is equal to that of a conventional direct-mapped cache with eight times as much space. In addition, the simulations show that our cache achieves better performance than a 2-way or 4-way set associative cache with twice as much space. The average miss ratio, compared with the victim cache with 32-byte block size, is improved by about 41p or 60p for general applications and multimedia applications, respectively. It is also shown that power consumption of the proposed cache is around 10p to 60p lower than other cache systems that we examine. Our cache system thus offers high performance with low power consumption and low hardware cost.

12 citations

Proceedings ArticleDOI
27 Oct 2010
TL;DR: Adaptive Insertion and Re-reference Prediction (AI-RRP) policy is proposed which evicts data based on both re-reference prediction value and the access recency information to make the replacement policy more adaptive across different workloads and different phases during execution.
Abstract: Previous research shows that LRU replacement policy is not efficient when applications exhibit a distant re-reference interval. Recently proposed RRIP policy improves performance for such workloads. However, RRIP lacks of access recency information, which may confuse the replacement policy to make accurate prediction. Consequently, RRIP is not robust for recency-friendly workloads. This paper proposes an Adaptive Insertion and Re-reference Prediction (AI-RRP) policy which evicts data based on both re-reference prediction value and the access recency information. To make the replacement policy more adaptive across different workloads and different phases during execution, Dynamic AI-RRP (DAI-RRP) is proposed which adjusts the insertion position and prediction value for different access patterns. Simulation results show DAI-RRP reduces CPI over LRU and Dynamic RRIP by an average of 8.3% and 4.1% respectively on a single-core processor with a 1MB 16-way set last-level cache (LLC). Evaluations on quad-core CMP with a 4MB shared LLC show that DAI-RRP outperforms LRU and Dynamic RRIP (DRRIP) on the weighted speedup metric by an average of 13.2% and 26.7% respectively. Furthermore, compred to LRU, DAI-RRP requires similar hardware, or even less hardware for high-associativity cache.

12 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations