scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
C.J. Choi1, Gi-Ho Park1, Ji Hyun Lee1, Woo-Chan Park1, Tack-Don Han1 
14 May 2000
TL;DR: Performance evaluation is carried out through trace-driven simulation using the DineroIII, and the results reveal that victim cache also has cost-performance effectiveness in texture mapping.
Abstract: Texture mapping is commonly used to make images realistic in most current graphics systems. Texture mapping, however, requires high memory bandwidth and low memory latency to obtain good performance. Recently, a few studies have been carried out to use a cache memory system in texture mapping in older to overcome these problems and those studies show that the cache is useful for texture mapping. Miss distribution of texture cache is analyzed and we find out that quite a few conflict misses occurred by period. Considering this fact, cache systems such as Victim, Half and Half and Cooperative cache which are known to be effective to reduce conflict misses, are evaluated and compared to each other. Performance evaluation is carried out through trace-driven simulation using the DineroIII, and the results reveal that victim cache also has cost-performance effectiveness in texture mapping.

11 citations

Journal ArticleDOI
17 Sep 2005
TL;DR: It is shown that cache memories for embedded applications can be designed to increase performance while reduce area and energy consumed, and that such a split data cache can also benefit embedded applications.
Abstract: In this paper we show that cache memories for embedded applications can be designed to increase performance while reduce area and energy consumed. Previously we have shown that separating data cache into an array cache and a scalar cache can lead to significant performance improvements for scientific benchmarks. In this paper we show that such a split data cache can also benefit embedded applications. To further improve the split cache organization, we augment the scalar cache with a small victim cache and the array cache with a small stream buffer. This "integrated" cache organization can lead to 43% reduction in the overall cache size, 37% reduction in access time and a 63% reduction in power consumption when compared to a unified 2-way set associative data cache for media benchmarks from MiBench suite.

11 citations

Proceedings ArticleDOI
11 Oct 2004
TL;DR: The simulation result shows that the proposed architecture can reduce a per access power consumption by 59% over conventional set-associative caches with average 0.06% of negligible performance loss.
Abstract: This paper proposes a power-aware cache block allocation algorithm for the way-selective set-associative cache on embedded systems to reduce energy consumption without additional delay or performance degradation. For this goal, way selection logic and specialized replacement policy are designed to enable only one way of set-associative cache as in the direct-mapped cache. Overall cache access time becomes almost the same as that of a conventional set associative cache with accessing additional way selection logic. Because data array can be accessed without waiting for tag comparison, multiplexer delay can be removed totally. The simulation result shows that the proposed architecture can reduce a per access power consumption by 59% over conventional set-associative caches with average 0.06% of negligible performance loss.

11 citations

Journal ArticleDOI
01 Oct 2012
TL;DR: An Algorithm-level Feedback-controlled Adaptive (AFA) data prefetcher is proposed that provides an algorithm-level adaptation and is capable of dynamically adapting to appropriate prefetching algorithms at runtime and achieves considerable IPC improvement for 21 representative SPEC-CPU benchmarks.
Abstract: The rapid advance of processor architectures such as the emerged multicore architectures and the substantially increased computing capability on chip have put more pressure on the sluggish memory systems than ever. In the meantime, many applications become more and more data intensive. Data-access delay, not the processor speed, becomes the leading performance bottleneck of high-performance computing. Data prefetching is an effective solution to accelerating applications' data access and bridging the growing gap between computing speed and data-access speed. Existing works of prefetching, however, are very conservative in general, due to the computing power consumption concern of the past. They suffer low effectiveness especially when applications' access pattern changes. In this study, we propose an Algorithm-level Feedback-controlled Adaptive (AFA) data prefetcher to address these issues. The AFA prefetcher is based on the Data-Access History Cache, a hardware structure that is specifically designed for data access acceleration. It provides an algorithm-level adaptation and is capable of dynamically adapting to appropriate prefetching algorithms at runtime. We have conducted extensive simulation testing with the SimpleScalar simulator to validate the design and to analyze the performance gain. The simulation results show that the AFA prefetcher is effective and achieves considerable IPC (Instructions Per Cycle) improvement for 21 representative SPEC-CPU benchmarks.

11 citations

Journal ArticleDOI
TL;DR: In this article, the on-line caching problem in a restricted cache where each memory item can be placed in only a restricted subset of cache locations has been studied, and the results show that restricted caches are significantly more complex than identical caches.
Abstract: We study the on-line caching problem in a restricted cache where each memory item can be placed in only a restricted subset of cache locations. Examples of restricted caches in practice include victim caches, assist caches, and skew caches. To the best of our knowledge, all previous on-line caching studies have considered on-line caching in identical or fully-associative caches where every memory item can be placed in any cache location.In this paper, we focus on companion caches, a simple restricted cache that includes victim caches and assist caches as special cases. Our results show that restricted caches are significantly more complex than identical caches. For example, we show that the commonly studied Least Recently Used algorithm is not competitive unless cache reorganization is allowed while the performance of the First In First Out algorithm is competitive but not optimal. We also present two near optimal algorithms for this problem as well as lower bound arguments.

11 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations