Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers
01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.
Citations
More filters
••
01 Jan 2000TL;DR: The concept of a dynamic and adaptive cache, two new prefetch policies, and the design of an instruction DAC, called the DAC/sup 3/, which dynamically changes its prefetch policy at runtime, in response to process execution characteristics, are introduced.
Abstract: This paper begins an exploration of the applicability of traditional prefetching policies in multiprocessor architectures. In particular, the effectiveness of prefetching policies as a function of both the quality of the prefetching and the consumption of processor to memory bandwidth is an issue of interest. Addressing this issue, the concept of a dynamic and adaptive cache (DAC), two new prefetch policies, and the design of an instruction DAC, called the DAC/sup 3/, which dynamically changes its prefetch policy at runtime, in response to process execution characteristics, are introduced. In addition, a detailed performance analysis of the DAC/sup 3/ and two new prefetch policies, which the DAC/sup 3/ uses, are presented; the performance of the DAC/sup 3/ is compared to that of the SSB prefetch instruction cache, which is based on Jouppi's sequential stream buffer design. This performance analysis is based on a new metric called CompositeCPI, which captures the usefulness of prefetches and their cost in terms of consumed memory bandwidth. The performance analysis indicates that, for the cache configurations and multiprogram workloads studied, the DAC/sup 3/ is superior to the SSB instruction prefetch cache.
8 citations
••
01 Sep 2006TL;DR: In this article, the architecture and design of the pipelined execution unit of a 32-bit RISC processor are described in detail in verilog HDL and functional verification policies adopted for it have been described thoroughly.
Abstract: The paper describes the architecture and design of the pipelined execution unit of a 32-bit RISC processor. Organization of the blocks in different stages of pipeline is done in such a way that pipeline can be clocked at high frequency. Control and forward of `data flow' among the stages are taken care by dedicated hardware logic. Different blocks of the execution unit and dependency among themselves are explained in details with the help of relevant block diagrams. The design has been modeled in verilog HDL and functional verification policies adopted for it have been described thoroughly. Synthesis of the design is carried out at 0.13-micron standard cell technology and for slow timing library the reported frequency of operation is 714 MHz at synthesis level.
8 citations
••
30 Nov 2011TL;DR: Cache-Convection-Control-based Prefetch Optimization (CCCPO) as discussed by the authors exploits the characteristics of cache line reuse and controls the prefetcher aggressiveness.
Abstract: One of the significant issues of processor architecture is to overcome memory latency. Prefetching can greatly improve cache performance, however, it has the drawback of cache pollution unless its aggressiveness is properly set. Although several techniques for prefetcher throttling have been proposed which use accuracy as a metric, their robustness were not sufficient due to the variations between program working set sizes and cache capacities. In this paper, we revisit cache behavior with the viwepoint of data lifetime in a cache with prefetching. Based on this observation Cache-Convection-Control-based Prefetch Optimization (CCCPO) is proposed, which exploits the characteristics of cache line reuse and controls the prefetcher aggressiveness. Evaluation results showed that this novel approach achieved 4.6% improvement against the most recent prefetcher throttling algorithms in the geometric mean of SPEC CPU 2006 benchmark suite with 256KB LLC.
8 citations
••
TL;DR: The combination of the proposed prefetch-aware router and congestion-sensitive prefetch control improves the performance of benchmark applications by 11-12% with out-of-order cores, and 21-22% with SMT cores on average, up to 37% on some of the workloads.
Abstract: Hardware prefetching has become an essential technique in high performance processors to hide long external memory latencies. In multi-core architectures with cores communicating through a shared on-chip network, traffic generated by the prefetchers can account for up to 60% of the total on-chip network traffic. However, the distinct characteristics of prefetch traffic have not been considered in on-chip network design. In addition, prefetchers have been oblivious to the network congestion. In this work, we investigate the interactions between prefetchers and on-chip networks, exploiting the synergy of these two components in multi-cores. Firstly, we explore the design space of prefetch-aware on-chip networks. Considering the difference between prefetch and non-prefetch packets, we propose a priority-based router design, which selects non-prefetch packets first over prefetch packets. Secondly, we investigate network-aware prefetcher designs. We propose a prefetch control mechanism sensitive to network congestion—throttling prefetch requests based on the current network congestion. Our evaluation with full system simulations shows that the combination of the proposed prefetch-aware router and congestion-sensitive prefetch control improves the performance of benchmark applications by 11–12% with out-of-order cores, and 21–22% with SMT cores on average, up to 37% on some of the workloads.
8 citations
References
More filters
••
[...]
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.
1,614 citations
01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright 1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA
467 citations
••
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.
316 citations
••
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
315 citations
••
17 May 1988TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.
236 citations