scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
19 Jul 1998
TL;DR: This work proposes a different approach to cache analysis-viewing caches as filters-and presents two new metrics for analyzing cache behavior: instantaneous hit rate and instantaneous locality.
Abstract: As the processor-memory performance gap continues to grow, so does the need for effective tools and metrics to guide the design of efficient memory hierarchies to bridge that gap. Aggregate statistics of cache performance can be useful for comparison, but they give us little insight into how to improve the design of a particular component. We propose a different approach to cache analysis-viewing caches as filters-and present two new metrics for analyzing cache behavior: instantaneous hit rate and instantaneous locality. We demonstrate how these measures can give us insight into the reference pattern of an executing program, and show an application of these measures in analyzing the effectiveness of the second level cache of a particular memory hierarchy.

29 citations

Proceedings ArticleDOI
15 Sep 2007
TL;DR: A simple probabilistic filtering mechanism based on random sampling to identify and select the frequently used blocks is suggested and it is shown that a 16K direct-mapped LI cache, augmented with a fully-associative 2K filter, achieves on average over 10% more instructions per cycle than a regular 16 K, 4-way set-association cache, and even ~5% more IPC than a 32 K,4-way cache.
Abstract: Distinguishing transient blocks from frequently used blocks enables servicing references to transient blocks from a small fully-associative auxiliary cache structure. By inserting only frequently used blocks into the main cache structure, we can reduce the number of conflict misses, thus achieving higher performance and allowing the use of direct mapped caches which offer lower power consumption and lower access latencies. We suggest using a simple probabilistic filtering mechanism based on random sampling to identify and select the frequently used blocks. Furthermore, by using a small direct-mapped lookup table to cache the most recently accessed blocks in the auxiliary cache, we eliminate the vast majority of the costly fully-associative lookups. Finally, we show that a 16K direct-mapped LI cache, augmented with a fully-associative 2K filter, achieves on average over 10% more instructions per cycle than a regular 16 K, 4-way set-associative cache, and even ~5% more IPC than a 32 K, 4-way cache, while consuming 70%-80% less dynamic power than either of them.

29 citations

Journal ArticleDOI
27 Sep 2003
TL;DR: The case for a shared I-cache organization in a CMP, instead of the traditional approach of using a dedicated I- cache per processor, is made, which results in an improvement in miss rate over a dedicated cache organization, for the same total capacity.
Abstract: Due to their large code footprint, OLTP workloads suffer from significant I-cache miss rates on contemporary microprocessors This paper analyzes the I-stream behavior of an OLTP workload, called the Oracle Database Benchmark (ODB), on Chip-Multiprocessors (CMP) Our results show that, although, the overall code footprint of ODB is large, multiple ODB threads running concurrently on multiple processors tend to access common code segments frequently, thus exhibiting significant constructive sharing In fact, in a CMP system, an I-cache shared between multiple processors incurs similar miss rate as a dedicated I-cache per processor where the per processor I-cache has the same capacity as the shared I-cache Based on these observations, this paper makes the case for a shared I-cache organization in a CMP, instead of the traditional approach of using a dedicated I-cache per processorFurthermore, this paper shows that OLTP code stream exhibits good spatial locality Adding a simple dedicated Line Buffer per processor can exploit this spatial locality effectively, to reduce latency and bandwidth requirements on the shared cache The proposed shared I-cache organization results in an improvement of at least 5X in miss rate over a dedicated cache organization, for the same total capacity

29 citations

Proceedings ArticleDOI
09 Jun 2007
TL;DR: An analytical model based on random population of an ownership table is presented by concurrently executing transactions that correctly predicts the trends in measured data and calls into question the viability of such an optimization that can undermine the scalability and concurrency claims of software transactional memory.
Abstract: Many word-based Software TransactionalMemory systems (STMs) have been proposed using tagless ownership tables, where read and write permissions are granted at the granularity of all addresses that map to a given ownership table entry. This optimization to reduce overhead potentially results in false conflicts. Using address traces from a multithreaded program, we demonstrate that the frequency of these false conflicts grows superlinearly with both the TM data footprint and concurrency and that increasing the size of the ownership table results in only a sub-linear reduction in conflict rate. These somewhat surprising relationships have a theoretical foundation that is also responsible for the (naively) unintuitive statistical result generally referred to as the "Birthday Paradox." We present an analytical model based on random population of an ownership table by concurrently executing transactions that correctly predicts the trends in measured data. These results call into question the viability of such an optimization that can undermine the scalability and concurrency claims of software transactional memory.

29 citations

Proceedings ArticleDOI
28 Sep 2003
TL;DR: This paper investigates a range of memory architectures that can be used to implement a wide range of packet classification caches and shows that small levels of associativity can result in enormous performance gains, and that replacement policies can give modest performance improvements for under-provisioned caches.
Abstract: Emerging network applications require packet classification at line speed on multiple header fields. Fast packet classification requires a careful attention to memory resources due to the size and speed limitations in SRAM and DRAM memory used to implement the function. In this paper, we investigate a range of memory architectures that can be used to implement a wide range of packet classification caches. In particular, we examine their performance under real network traces in order to identify features that have the greatest impact. Through experiments, we show that a cache's associativity, replacement policy, and hash function all contribute in varying magnitudes to the cache's overall performance. Specifically, we show that small levels of associativity can result in enormous performance gains, that replacement policies can give modest performance improvements for under-provisioned caches, and that faster, less complex hashes can improve overall cache performance.

29 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations