scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Patent
07 Jun 2006
TL;DR: In this paper, a device and method is illustrated to prefetch information based on a location of an instruction that resulted in a cache miss during its execution, which is used to generate prefetch requests for a current cache miss.
Abstract: A device and method is illustrated to prefetch information based on a location of an instruction that resulted in a cache miss during its execution. The prefetch information to be accessed is determined based on previous and current cache miss information. For example, information based on previous cache misses is stored at data records as prefetch information. This prefetch information includes location information based on an instruction that caused a previous cache miss, and is accessed to generate prefetch requests for a current cache miss. The prefetch information is updated based on current cache miss information.

8 citations

Journal ArticleDOI
TL;DR: The underutilised NoC router buffers and the unused trace buffers are exploited to store recently evicted cache blocks to reduce the miss penalty and achieve improvement in overall system performance.
Abstract: Due to limited on-chip caching, data-driven applications with large memory footprint encounter frequent cache misses Such applications suffer from recurring miss penalty when they re-reference recently evicted cache blocks To meet the worst-case performance requirements, Network-on-Chip (NoC) routers are provisioned with input port buffers However, recent studies reveal that these buffers remain underutilised except during network congestion Trace buffers are Design-for-Debug (DfD) hardware employed in NoC routers for post-silicon debug and validation Nevertheless, they become non-functional once a design goes into production and remain in the routers left unused In this article, we exploit the underutilised NoC router buffers and the unused trace buffers to store recently evicted cache blocks While these blocks are stored in the buffers, future re-reference to these blocks can be replied from the NoC router Such an opportunistic caching of evicted blocks in NoC routers significantly reduce the miss penalty Experimental analysis shows that the proposed architectures can achieve up to 21 percent (16 percent on average) reduction in miss penalty and 19 percent (14 percent on average) improvement in overall system performance While we have a negligible area and leakage power overhead of 258 and 394 percent, respectively, dynamic power reduces by 612 percent due to the improvement in performance

8 citations

01 Jan 2000
TL;DR: Results in this paper show that RAMpage scales better than a standard second-level cache, because the number of DRAM references is lower, and allows the possibility of taking a context switch on a miss, which is shown to further improve scalability.
Abstract: The RAMpage memory hierarchy is an alternative to the traditional division between cache and main memory: main memory is moved up a level and DRAM is used as a paging device. As the CPU-DRAM speed gap grows, it is expected that the RAMpage approach should become more viable. Results in this paper show that RAMpage scales better than a standard second-level cache, because the number of DRAM references is lower. Further, RAMpage allows the possibility of taking a context switch on a miss, which is shown to further improve scalability. The paper also suggests that memory wall work ought to include the TLB, which can represent a significant fraction of e xecution time. With context switches on misses, the speed improvement at an 8 GHz instruction issue rate is 62% over a standard 2-level cache hierarchy.

8 citations

Proceedings ArticleDOI
03 Oct 2016
TL;DR: This paper compared the state of the art CAMEO migration policy with a Markov-like prefetcher for a hybrid memory consisting of HBM (3D-stacked DRAM) and Phase Change Memory (PCM) using a set of SPEC CPU2006 and several HPC benchmarks to conjecture that by tuning well-known prefetch mechanism for hybrid memories the authors can achieve substantial performance improvement.
Abstract: The promise of 3D-stacked memory solving the memory wall has led to many emerging architectures that integrate 3D-stacked memory into processor memory in a variety of ways including systems that utilize different memory technologies, with different performance and power characteristics, to comprise the system memory. It then becomes necessary to manage these memories such that we get the performance of the fastest memory while having the capacity of the slower but larger memories. Some research in industry and academia proposed using 3D-stacked DRAM as a hardware managed cache. More recently, particularly pushed by the demands for ever larger capacities, researchers are exploring the use of multiple memory technologies as a single main memory. The main challenge for such flat-address-space memories is the placement and migration of memory pages to increase the number of requests serviced from faster memory, as well as managing overhead due to page migrations. In this paper we ask a different question: can traditional prefetching be a viable solution for effective management of hybrid memories? We conjecture that by tuning well-known prefetch mechanism for hybrid memories we can achieve substantial performance improvement. To test our conjecture, we compared the state of the art CAMEO migration policy with a Markov-like prefetcher for a hybrid memory consisting of HBM (3D-stacked DRAM) and Phase Change Memory (PCM) using a set of SPEC CPU2006 and several HPC benchmarks. We find that CAMEO provides better performance improvement than prefetching for 2/3rd of the workloads (by 59%) and prefetching is better than CAMEO for the remaining 1/3rd (by 19%). The EDP analysis shows that the prefetching solution improves EDP over the no-prefetching baseline whereas CAMEO does worse in terms of average EDP. These results indicate that prefetching should be reconsidered as a supplementary technique to data migration.

8 citations

Proceedings ArticleDOI
20 Apr 2009
TL;DR: A cache organization called the clean/dirty cache (CD-cache) is proposed that combines the properties of write-back and write-through, which avoids unnecessary transfers for recurring writes, while restricting the number of dirty lines to a hard limit.
Abstract: Caches often employ write-back instead of write-through, since write-back avoids unnecessary transfers for multiple writes to the same block. For several reasons, however, it is undesirable that a significant number of cache lines will be marked "dirty". Energy-efficient cache organizations, for example, often apply techniques that resize, reconfigure, or turn off (parts of) the cache. In such cache organizations, dirty lines have to be written back before the cache is reconfigured. The delay imposed by these write-backs or the required additional logic and buffers can significantly reduce the attained energy savings. A cache organization called the clean/dirty cache (CD-cache) is proposed that combines the properties of write-back and write-through. It avoids unnecessary transfers for recurring writes, while restricting the number of dirty lines to a hard limit. Detailed experimental results show that the CD-cache reduces the number of dirty lines significantly, while achieving similar or better performance. We also use the CD-cache to implement cache decay. Experimental results show that the CD-cache attains similar or higher performance than a normal decay cache, while using a significantly less complex design.

8 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations