scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
01 Jan 1999
TL;DR: A method called DPP (dynamic procedure placement) for placing procedures at run time for good i-cache performance and is implemented for Digital Unix on Alpha servers and uses a run-time loader similar to a shared-library loader and is completely transparent.
Abstract: Commercial applications such as database servers often have very large instruction footprints and consequently are frequently stalled due to instruction cache misses. A large fraction of the i-cache misses are typically due to conflicts in the relatively small direct-mapped onchip instruction caches. A variety of tools have been developed to try to order the procedures of an application to minimize these conflicts. Such tools often make use of profile information to place procedures so that procedures that frequently call each other do not conflict in the i-cache. However, users often avoid using any kind of tool that requires them to do extra profiling and linking steps to optimize their application. In addition, any tool that does a static layout of procedures (whether using profiling information or not) cannot adapt to varying application workloads that cause very different application behavior. We have developed a method called DPP (dynamic procedure placement) for placing procedures at run time for good i-cache performance and have implemented it for Digital Unix on Alpha servers. Our system uses a run-time loader similar to a shared-library loader and is completely transparent. Our method efficiently invokes the loader at procedures calls and also correctly handles indirect procedure calls. We have developed a variety of extensions, including run-time code optimizations and a way of restarting the procedure placement while an application is running. In this paper, we describe our method, explain the difficult implementation issues, discuss our extensions, and give performance results for a variety of benchmarks and the Oracle database server. We also provide detailed simulation results using the SimOSAlpha full-machine simulator.

16 citations

Journal ArticleDOI
TL;DR: This paper proposes the addition of Thread Row Buffers (TRBs) to DRAM memories, thereby increasing DRAM efficiency by avoiding alternate accesses to a limited number of rows and allowing the implementation of a memory scheduler not bound to the throughput-isolation tradeoff.
Abstract: The widespread adoption of chip multiprocessors in recent years has increased the number of applications simultaneously accessing DRAM memories. Therefore, memory access patterns have also changed and this has reduced row buffer locality significantly, degrading performance and energy efficiency. Furthermore, concurrent execution of applications also has shown the need of performance isolation among threads in the memory controller to enforce a quality of service in virtualized environments. Existing DRAM memories, however, enforce a tradeoff between throughput and isolation. To solve these problems, this paper proposes the addition of Thread Row Buffers (TRBs) to DRAM memories. TRBs keep an active row per thread, thereby increasing DRAM efficiency by avoiding alternate accesses to a limited number of rows and allowing the implementation of a memory scheduler not bound to the throughput-isolation tradeoff. Thread Row Buffers with Service Partitioning (TRB-SP) increase the row hit-rate by 38 percent with respect to FR-FCFS and by 11 percent with respect to Cache DRAM. This, in turn, increases overall performance by 17 and 7 percent, respectively. TRB-SP is also able to reduce the standard deviation of the memory access time of an application by 40 percent over FR-FCFS, 31 percent over PAR-BS, and 42 percent over Cache DRAM.

16 citations

01 Jan 2004
TL;DR: This dissertation proposes a hardware/software cooperative approach that combines the global yet imperfect view of the compiler with the timely yet narrow-scope context of the hardware to overcome the memory bottleneck with cooperative hardware/ software techniques.
Abstract: The memory system remains a major performance bottleneck in modern and future architectures. In this dissertation, we propose a hardware/software cooperative approach and demonstrate its effectiveness. This approach combines the global yet imperfect view of the compiler with the timely yet narrow-scope context of the hardware. It relies on a light-weight extension to the instruction set architecture to convey compile-time knowledge (hints) to the hardware. The hardware then uses these hints to make better decisions. Our work shows that a cooperative hardware/software approach to (1) cache replacement, (2) prefetching, and (3) their combination eliminates or tolerates much of the memory performance bottleneck. (1) Our work enhances cache replacement decisions using compiler hints. The compiler detects which data will or will not be reused and annotates loads accordingly. The compiler sets one bit (the evict-me bit) to denote a preferred eviction candidate. On a miss, the cache replacement algorithm preferentially replaces a cache line with its evict-me bit set. Otherwise, it follows the LRU policy. The evict-me replacement scheme improves cache replacement decisions and is effective in both L1 and L2 caches. (2) We also use compiler hints to direct aggressive hardware region prefetching and content-aware pointer prefetching. The original SRP (scheduled region prefetching) engine queues prefetching requests on every outstanding L2 miss and tolerates latencies at the cost of dramatically increasing the memory traffic. GRP (guided region prefetching) enhances SRP by restricting prefetching to compiler-marked loads. Our compiler algorithms effectively mark spatial reuses across the SPEC CPU2000 benchmarks, and thus GRP achieves the performance of SRP with only one eighth of the additional traffic. (3) The evict-me cache replacement scheme helps alleviate the side effects of cache pollution introduced by useless region prefetches. The combination of evict-me caching and region prefetching further improves cache performance. These results demonstrate significant promise for overcoming the memory bottleneck with cooperative hardware/software techniques.

16 citations

Proceedings ArticleDOI
16 Sep 2002
TL;DR: This paper proposes a novel framework for the use of data prefetchers that are trained off-line using smart learning algorithms to produce prediction models which captures hidden memory access patterns and believes it is amenable to other predictors and can be done as a phase of the profiling-optimizing-compiler.
Abstract: An important technique for alleviating the memory bottleneck is data prefetching. Data prefetching solutions ranging from pure software approach by inserting prefetch instructions through program analysis to purely hardware mechanisms have been proposed. The degrees of success of those techniques are dependent on the nature of the applications. The need for innovative approach is rapidly growing with the introduction of applications such as object-oriented applications that show dynamically changing memory access behavior In this paper, we propose a novel framework for the use of data prefetchers that are trained off-line using smart learning algorithms to produce prediction models which captures hidden memory access patterns. Once built, those prediction models are loaded into a data prefetching unit in the CPU at the appropriate point during the runtime to drive the prefetching. On average by using table size of about 8KB size, we were able to achieve prediction accuracy of about 68% through our own proposed learning method and performance was boosted about 37% on average on the benchmarks we tested. Furthermore, we believe our proposed framework is amenable to other predictors and can be done as a phase of the profiling-optimizing-compiler.

16 citations

09 Aug 1993
TL;DR: This paper describes the use of hardware-assisted acc ss ordering on a uniprocessor system that combines compile-time detection of memory access patterns with a memory subsystem that decouples the order of requests generated by the processor from that issued to the memory system.
Abstract: Hardware Support for Dynamic Access Ordering: Performance of Some Design Options Sally A. McKee Department of Computer Science University of Virginia Charlottesville, VA, 22903 mckee@virginia.edu Memory bandwidth is rapidly becoming the performance bottleneck in the application of high performance microprocessors to vector -like algorithms, including the “grand challenge” scientific problems. Caching is not the sole solution for these applications due to the poor temporal and spatial locality of their data accesses. Moreover , the nature of memories themselves has changed. Achieving greater bandwidth requires exploiting the characteristics of memory components “on the other side of the cache” — they should not be treated as uniform access-time RAM. This paper describes the use of hardware-assisted acc ss ordering on a uniprocessor system. Our technique combines compile-time detection of memory access patterns with a memory subsystem that decouples the order of requests generated by the processor from that issued to the memory system. This decoupling permits the requests to be issued in an order that optimizes use of the memory system. We present numerous simulation results showing significant speedup on important scientific kernels.

16 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations