scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
08 Oct 2002
TL;DR: A data filtering method to reduce the power consumption of high-end processors with multiple execution cores and improves the performance by as much as 76.7% for a processor with 16 execution cores.
Abstract: We propose and evaluate a data filtering method to reduce the power consumption of high-end processors with multiple execution cores. Although the proposed method can be applied to a wide variety of multi-processor systems including MPPs, SMPs and any type of single-chip multiprocessor, we concentrate on Network Processors. The proposed method uses an execution unit called Data Filtering Engine that processes data with low temporal locality before it is placed on the system bus. The execution cores use locality to decide which load instructions have low temporal locality and which portion of the surrounding code should be off-loaded to the data filtering engine.Our technique reduces the power consumption, because a) the low temporal data is processed on the data filtering engine before it is placed onto the high capacitance system bus, and b) the conflict misses caused by low temporal data are reduced resulting in fewer accesses to the L2 cache. Specifically, we show that our technique reduces the bus accesses in representative applications by as much as 46.8% (26.5% on average) and reduces the overall power by as much as 15.6% (8.6% on average) on a single-core processor. It also improves the performance by as much as 76.7% (29.7% on average) for a processor with 16 execution cores.

22 citations

Journal ArticleDOI
TL;DR: There is a linear relationship between overall performance and three metrics: percentage of misses prefetched, percentage of unused prefetches, and bandwidth, and it is shown that under ideal conditions, prefetching can remove nearly all of the stalls associated with cache misses.
Abstract: We formulate a new approach for evaluating a prefetching algorithm. We first carry out a profiling run of a program to identify all of the misses and corresponding locations in the program where prefetches for the misses can be initiated. We then systematically control the number of misses that are prefetched, the timeliness of these prefetches, and the number of unused prefetches. We validate the accuracy of our approach by comparing it to one based on a Markov prefetch algorithm. This allows us to measure the potential benefit that any application can receive from prefetching and to analyze application behavior under conditions that cannot be explored with any known prefetching algorithm. Next, we analyze a system parameter that is vital to prefetching performance, the line transfer interval, which is the number of processor cycles required to transfer a cache line. This interval is determined by technology and bandwidth. We show that under ideal conditions, prefetching can remove nearly all of the stalls associated with cache misses. Unfortunately, real processor implementations are less than ideal. In particular, the trend in processor frequency is outrunning on-chip and off-chip bandwidths. Today, it is not uncommon for processor frequency to be three or four times bus frequency. Under these conditions, we show that nearly all of the performance benefits derived from prefetching are eroded, and in many cases prefetching actually degrades performance. We carry out quantitative and qualitative analyses of these tradeoffs and show that there is a linear relationship between overall performance and three metrics: percentage of misses prefetched, percentage of unused prefetches, and bandwidth.

22 citations

Proceedings ArticleDOI
22 Sep 2002
TL;DR: This paper presents a hybrid approach to the prefetching problem that combines both software and hardwarePrefetching in a cost-effective way by needing very little hardware support and impacting minimally the design of the processor pipeline.
Abstract: Ever increasing memory latencies and deeper pipelines push memory farther from the processor. Prefetching techniques aim is to bridge these two gaps by fetching data in advance to both the L1 cache and the register file. Our main contribution in this paper is a hybrid approach to the prefetching problem that combines both software and hardware prefetching in a cost-effective way by needing very little hardware support and impacting minimally the design of the processor pipeline. The prefetcher is built on-top of a static memory instruction bypassing, which is in charge of bringing prefetched values in the register file. In this paper we also present a thorough analysis of the limits of both prefetching and memory instruction bypassing. We also compare our prefetching technique with a prior speculative proposal that attacked the same problem, and we show that at much lower cost, our hybrid solution is better than a realistic implementation of speculative prefetching and bypassing. On average, our hybrid implementation achieves a 13% speed-up improvement over a version with software prefetching in a subset of numerical applications and an average of 43% over a version with no software prefetching (achieving up to a 102% for specific benchmarks).

22 citations

Journal Article
TL;DR: A multi-level prefetching framework with three setups, respectively aimed to minimize cost, minimize losses in individual applications or maximize performance with moderate cost, is introduced.
Abstract: We introduce a multi-level prefetching framework with three setups, respectively aimed to minimize cost (Mincost), minimize losses in individual applications (Minloss) or maximize performance with moderate cost (Maxperf). Performance is boosted in all cases by a sequential tagged prefetcher in the L1 cache, with an effective static degree policy. In both cache levels (L1 and L2), we also apply prefetch filters. In the L2 cache we use a novel adaptive policy that selects the best prefetching degree within a fixed set of values, by tracking the performance gradient. Mincost resorts to sequential tagged prefetching in the L2 cache as well. Minloss relies on an accurate, home-made, correlating prefetcher (PDFCM, Differencial Finite Context Method Prefetcher). Maxperf maximizes performance at the expense of slight performance losses in a small number of benchmarks, by integrating a sequential tagged prefetcher with PDFCM in the L2 cache.

22 citations

Journal ArticleDOI
TL;DR: It is shown that these references significantly affect the IPC (instructions per cycle) performance of a processor and it is crucial to model wrong-path references to accurately estimate the performance improvement provided by runahead execution.
Abstract: High-performance, out-of-order execution processors spend a significant portion of their execution time on the incorrect program path even though they employ aggressive branch prediction algorithms. Although memory references generated on the wrong path do not change the architectural state of the processor, they affect the arrangement of data in the memory hierarchy. This paper examines the effects of wrong-path memory references on processor performance. It is shown that these references significantly affect the IPC (instructions per cycle) performance of a processor. Not modeling them leads to errors of up to 10 percent (4 percent on average) in IPC estimates for the SPEC CPU2000 integer benchmarks on an out-of-order processor and errors of up to 63 percent on a runahead-execution processor. In general, the error in the IPC increases with increasing memory latency and instruction window size. We find that wrong-path references are usually beneficial for performance because they prefetch data that is used by later correct-path references. L2 cache pollution is found to be the most significant negative effect of wrong-path references. Code examples are shown to provide insights into how wrong-path references affect performance. We also show that it is crucial to model wrong-path references to accurately estimate the performance improvement provided by runahead execution.

22 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations