scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
01 Feb 1997
TL;DR: Detailed simulation results show that the replacement traffic is reduced substantially for both approaches, indicating that breaking inclusion is an efficient way to bound the sensitivity for high memory pressure in COMA machines.
Abstract: In a multiprocessor with a Cache-Only Memory Architecture (COMA) all available memory is used to form large cache memories called attraction memories. These large caches help to satisfy shared memory accesses locally, reducing the need for node-external communication. However since a COMA has no back-up main memory, blocks replaced from one attraction memory must be relocated into another attraction memory. To keep memory overhead low, it is desirable to have most of the memory space filled with unique data. This leaves little space left for replication of cache blocks, resulting in that replacement traffic may become excessive. We have studied two schemes for removing the traditional demand for full inclusion between the lower-level caches and the attraction memory: the loose-inclusion and no-inclusion schemes. They differ in efficiency but also in implementation cost. Detailed simulation results show that the replacement traffic is reduced substantially for both approaches, indicating that breaking inclusion is an efficient way to bound the sensitivity for high memory pressure in COMA machines.

9 citations

Proceedings ArticleDOI
03 Jan 1996
TL;DR: Preliminary simulation results show that the prefetching approach, combined with aggressive consistency, can substantially improve the performance of DSM systems.
Abstract: Data prefetching is a technique where a processing unit issues one or more non-blocking load operations before the actual data items are required. The access latency of prefetching can be alleviated by overlapping it with other executions which are independent of the prefetched data. In distributed shared memory (DSM) systems, remote memory accesses take much longer than local ones, and hence data prefetching should be effective for such systems. However, to our knowledge, relatively little research has been done for data prefetching on DSM systems. This paper is concerned with issues of supporting data prefetching on DSM systems. Our approach is to develop a new memory consistency semantic (MCS) model under which the prefetchable shared data objects, as well as the best moment to launch a prefetching operation, can be easily identified. Our new MCS, called aggressive consistency, utilizes the coherence-on-demand concept and supports a special synchronization operation called SYNC, which also acts as the prefetching indicator. Preliminary simulation results show that our prefetching approach, combined with aggressive consistency, can substantially improve the performance of DSM systems.

9 citations

Proceedings ArticleDOI
04 Jan 1995
TL;DR: Comparison of the features and benefits of loop unrolling, software pipelining, and software cache prefetching in superscalar and superpipelined machines concludes that loop, unrolling and static scheduling of loads is seen to produce significant improvement in performance at lower latencies.
Abstract: Software oriented techniques to hide memory latency in superscalar and superpipelined machines include loop unrolling, software pipelining, and software cache prefetching. Issuing the data fetch request prior to actual need for data allows overlap of accessing with useful computations. Loop unrolling and software pipelining do not necessitate microarchitecture or instruction set architecture changes, whereas software controlled prefetching does. While studies on the benefits of the individual techniques have been done, no study evaluates all of these techniques within a consistent framework. This paper attempts to remedy this by providing a comparative evaluation of the features and benefits of the techniques. Loop, unrolling and static scheduling of loads is seen to produce significant improvement in performance at lower latencies. Software pipelining is observed to be better than software controlled prefetching at lower latencies, but at higher latencies, software prefetching outperforms software pipelining. Aggressive prefetching beyond conditional branches can detrimentally affect performance by increasing the memory bandwidth requirements and bus traffic. >

9 citations

Book ChapterDOI
01 Jan 2004
TL;DR: Four methods for tuning a microprocessors’ cache subsystem to the needs of any executing application for low-energy embedded systems are discussed and it is shown that a victim buffer can be very effective as a configurable parameter in a memory hierarchy.
Abstract: The power consumed by the memory hierarchy of a microprocessor can contribute to as much as 50% of the total microprocessor system power, and is thus a good candidate for power and energy optimizations. We discuss four methods for tuning a microprocessors’ cache subsystem to the needs of any executing application for low-energy embedded systems. We introduce onchip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune a configurable level-one cache’s total size, associativity and line size to an executing application. We extend the single-level cache tuning heuristic for a two-level cache using a methodology applicable to both a simulation-based exploration environment and a hardware-based system prototyping environment. We show that a victim buffer can be very effective as a configurable parameter in a memory hierarchy. We reduce static energy dissipation of on-chip data cache by compressing the frequent values that widely exist in a data cache memory.

9 citations

Proceedings ArticleDOI
10 Jan 2002
TL;DR: This paper has analyzed L0 instruction cache miss patterns and proposed an effective L 0 instruction cache management scheme through history-based prediction that reduces more than 95% the performance degradation in L0 caches while maintaining the energy advantage as shown by a lower energy-delay product.
Abstract: Advances in semiconductor technology have several impacts on processor design. One impact is that faster clock rates and slower wires will limit the number of transistors reachable in a single cycle. Another impact is that power management is becoming a design constraint due to increase in power density. A small L0 cache on top of a traditional L1 cache has the advantages of shorter access time and lower power consumption. The downside of a L0 cache is possible performance loss in the case of cache misses. In this paper, we have analyzed L0 instruction cache miss patterns and have proposed an effective L0 instruction cache management scheme through history-based prediction. For SPEC2000 benchmarks, the prediction hit rate is as high as 99% and the average hit rate is more than 93%. Compared to other L0 instruction cache management schemes, our scheme reduces more than 95% the performance degradation in L0 caches while maintaining the energy advantage as shown by a lower energy-delay product.

9 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations