Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers
01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.
Citations
More filters
••
01 Jun 1998
TL;DR: A novel compiler algorithm which automatically inserts instruction-prefetch instructions into the executable to prefetch the targets of control transfers far enough in advance is proposed, which results in speedups ranging from 9.4% to 18.5% over the original execution time on an out-of-order superscalar processor.
Abstract: : Instruction cache miss latency is becoming an increasingly important performance bottleneck, especially for commercial applications. Although instruction prefetching is an attractive technique for tolerating this latency, we find that existing prefetching schemes are insufficient for modern superscalar processors since they fail to issue prefetches early enough (particularly for non-sequential accesses). To overcome these limitations, we propose a new instruction prefetching technique whereby the hardware and software cooperate to hide the latency as follows. The hardware performs aggressive sequential prefetching combined with a novel prefetch filtering mechanism to allow it to get far ahead without polluting the cache. To hide the latency of non-sequential accesses, we propose and implement a novel compiler algorithm which automatically inserts instruction-prefetch instructions into the executable to prefetch the targets of control transfers far enough in advance. Our experimental results demonstrate that this new approach results in speedups ranging from 9.4% to 18.5% (13.3% on average) over the original execution time on an out-of-order superscalar processor, which is more than double the average speedup of the best existing schemes (6.5%). This is accomplished by hiding an average of 71% of the original instruction stall time, compared with only 36% for the best existing schemes. We find that both the prefetch filtering and compiler-inserted prefetching components of our design are essential and complementary, that the compiler can limit the code expansion to less than 10% on average, and that our scheme is robust with respect to variations in miss latency and bandwidth.
5 citations
••
07 Jun 2008TL;DR: This work proposes simple history-based classifiers that track commit stalls suffered by loads to help identify a small set of loads, referred to as LIMCOS (Loads Incurring Majority of COmmit Stalls), and demonstrates that the proposed classification criterion performs better than other existing criteria like criticality and delinquent loads.
Abstract: Loads that miss in L1 or L2 caches and waiting for their data at the head of the ROB cause significant slow down in the form of commit stalls. We identify that most of these commit stalls are caused by a small set of loads, referred to as LIMCOS (Loads Incurring Majority of COmmit Stalls). We propose simple history-based classifiers that track commit stalls suffered by loads to help us identify this small set of loads.We study an application of these classifiers to prefetching. The classifiers are used to train the prefetcher to focus on the misses suffered by LIMCOS. This, referred to as focused prefetching, results in a 9.8% gain in IPC over naive GHB based delta correlation prefetcher along with a 20.3% reduction in memory traffic for a set of 17 memory-intensive SPEC2000 benchmarks. Another important impact of focused prefetching is a 61% improvement in the accuracy of prefetches. We demonstrate that the proposed classification criterion performs better than other existing criteria like criticality and delinquent loads. Also we show that the criterion of focusing on commit stalls is robust enough across cache levels and can be applied to any prefetcher without any modifications to the prefetcher.
5 citations
••
14 Apr 2004TL;DR: Experimental results show that the BCC and SCC cache reduce the amount of traffic significantly in many cases and overall they incur the same number of cache misses as the direct-mapped cache.
Abstract: Off-chip memory accesses are a major source of power consumption in embedded processors. In order to reduce the amount of traffic between the processor and the off-chip memory as well as to hide the memory latency, nearly all embedded processors have a cache on the same die as the processor core. Because small caches dissipate less power and are cheaper than large caches, a small cache is preferable to a large cache. Furthermore, because set-associative caches consume more power than direct-mapped caches, a direct-mapped cache is preferable to a set-associative one. Small, direct-mapped caches generally incur many conflict misses, however. In this paper we propose and evaluate a structure called the Conflict Detection Table (CDT). This table can be used to determine if a memory access is expected to hit the cache. If a hit is expected and a miss occurs, then a conflict is detected and appropriate action can be taken. In addition, we propose two cache structures that employ this technique: the Bypass in Case of Conflict (BCC) cache and the Sub-block in Case of Conflict (SCC) cache. The BCC cache bypasses the cache when a conflict is detected, whereas the SCC cache fetches a sub-block of the missing cache block in such a case. Experimental results using several embedded workloads show that the BCC and SCC cache reduce the amount of traffic significantly in many cases. Furthermore, overall they incur the same number of cache misses as the direct-mapped cache. This shows that the BCC and SCC cache reduce the amount of power consumed with a negligible reduction in performance.
5 citations
••
01 Aug 2017TL;DR: This paper proposes a mechanistic model of memory level parallelism, which performs more accurate than existing works and can be sped up about 38 times compared to the Gem5 cycle-accurate simulations.
Abstract: Non-blocking caches, which are commonly utilized in modern out-of-order processors, could handle multiple outstanding memory requests simultaneously to reduce the penalties of long latency cache misses. Memory level parallelism (MLP), which refers to the number of memory requests concurrently held by Miss Status Handling Registers (MSHRs), is an indispensable factor to estimate cache performance. To achieve MLP efficiently, previous researches oversimplified the factors that need to be considered when constructing analytical models, especially for the influences of cache miss rate. By quantifying above cache miss rate effects, this paper proposes a mechanistic model of memory level parallelism, which performs more accurate than existing works. 15 benchmarks, chosen from Mobybench 2.0, Mibench 1.0 and MediaBench II, are adopted for evaluating the accuracy of our model. Compared to Gem5 cycle-accurate simulation results, the largest root mean square error is less than 11%, while the average one is around 7%. Meanwhile, the cache performance forecasting process can be sped up about 38 times compared to the Gem5 cycle-accurate simulations.
5 citations
••
TL;DR: A prefetching-based memory traffic-clustering scheme was shown to reduce the power and energy consumption of DRAM and improve its performance by an average of 0.2%, 28.9%, and 15.7%, respectively, for memory-intensive programs.
Abstract: DRAM is one of the main sources of energy consumption in computer systems. Thus, reducing the energy consumption of DRAM can prolong the lifetime of battery-operated embedded/mobile systems. To this end, we propose a DRAM energy-aware prefetching scheme to increase row buffer hits and idle periods of DRAM by clustering its accesses. Although prefetching schemes have traditionally been used to improve the system performance, utilizing them for the energy conservation of DRAM has yet to be investigated. For such energy conservation, our scheme accurately predicts and clusters potential future DRAM accesses. Clustered DRAM accesses exploit a popular first-ready first-come first-serve memory request scheduling and a power-down mode of DRAM more effectively; the probability of row buffer hits and idle periods is significantly increased by our clustering scheme. As a result, large amounts of row activation and idle energy consumption, which are major energy consumption factors in modern DRAM, can be saved. Our prefetching-based memory traffic-clustering scheme was shown to reduce the power and energy consumption of DRAM and improve its performance by an average of 0.2%, 28.9%, and 15.7%, respectively, for memory-intensive programs.
5 citations
References
More filters
••
[...]
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.
1,614 citations
01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright 1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA
467 citations
••
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.
316 citations
••
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
315 citations
••
17 May 1988TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.
236 citations