scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: The hybrid of two AI approaches named case based reasoning (CBR) and artificial neural networks (ANN) is applied to ameliorate the predictive performance of cache prefetching with improved predictive accuracy and reduced level of associated costs.
Abstract: Cache being the fastest medium in memory hierarchy has a vital role to play for fully exploiting available resources, concealing latencies in IO operations, languishing the impact of these latencies and hence in improving system response time. Despite plenty of efforts made, caches alone cannot comprehend larger storage requirements without prefetching. Cache prefetching is speculatively fetching data to restrain all delays. However, effective prefetching requires a strong prediction mechanism to load relevant data with higher degree of accuracy. In order to ameliorate the predictive performance of cache prefetching, we applied the hybrid of two AI approaches named case based reasoning (CBR) and artificial neural networks (ANN). CBR maintains the past experience and ANN are used in adaptation phase of CBR instead of employing static rule base. The novelty of technique in this domain is valued due to hybrid of two approaches as well as usage of suffix tree in populating the CBR's case base. Suffix trees provide rich data patterns for populating case base and greatly enhanced the overall performance. A number of evaluations from different aspects with varying parameters are presented (along with some findings) where the efficacy of our technique is affirmed with improved predictive accuracy and reduced level of associated costs.

7 citations

Proceedings ArticleDOI
26 Mar 2019
TL;DR: FUSE as mentioned in this paper integrates spin-transfer torque magnetic random access memory (STT-MRAM) into the on-chip L1D cache to minimize the number of outgoing memory accesses over the interconnection network.
Abstract: In this work, we propose FUSE, a novel GPU cache system that integrates spin-transfer torque magnetic random-access memory (STT-MRAM) into the on-chip L1D cache. FUSE can minimize the number of outgoing memory accesses over the interconnection network of GPU's multiprocessors, which in turn can considerably improve the level of massive computing parallelism in GPUs. Specifically, FUSE predicts a read-level of GPU memory accesses by extracting GPU runtime information and places write-once-read-multiple (WORM) data blocks into the STT-MRAM, while accommodating write-multiple data blocks over a small portion of SRAM in the L1D cache. To further reduce the off-chip memory accesses, FUSE also allows WORM data blocks to be allocated anywhere in the STT-MRAM by approximating the associativity with the limited number of tag comparators and I/O peripherals. Our evaluation results show that, in comparison to a traditional GPU cache, our proposed heterogeneous cache reduces the number of outgoing memory references by 32% across the interconnection network, thereby improving the overall performance by 217% and reducing energy cost by 53%.

7 citations

Proceedings ArticleDOI
17 Dec 1998
TL;DR: Three cooperative cache techniques to assist dataPrefetching are proposed: default prefetching to increase the overall prefetch coverage; block concept to perform variable distance lookahead prefetched; and a spatial data buffer with load balancing to reduce the interference between spatial data and temporal data.
Abstract: Recent research in data cache prefetching is found to be selective in nature: achieving high prediction accuracy over a set of selected references such as array access with constant strides. As a result, for applications where the memory latency is mainly due to data accesses in the set of non selected references of a program, they lose their effectiveness. In fact, their performance might be worse than that of the traditional, less accurate prefetch-on-miss scheme. To overcome this situation, we propose three cooperative cache techniques to assist data prefetching. They are: [1] default prefetching to increase the overall prefetch coverage; [2] block concept to perform variable distance lookahead prefetching; and [3] a spatial data buffer with load balancing to reduce the interference between spatial data and temporal data. To illustrate the potentials of these techniques, they were implemented on top of our previously proposed Instruction Opcode-Based Prefetching (IOBP) scheme (T.F. Chen, 1993). Trace driven simulation on SPEC92 showed that a 8 Kbytes data cache with a 512 bytes spatial buffer can achieve similar performance as a 32 Kbytes data cache through these techniques.

7 citations

Book ChapterDOI
17 Sep 2002
TL;DR: It is shown that the deterministic competitive ratio for this problem is (n+1)(k+1) - 1, and the randomized competitive ratio is O(log n log k) and ?
Abstract: This paper is concerned with online cachingalg orithms for the (n, k)-companion cache, defined by Brehob et. al. [3]. In this model the cache is composed of two components: a k-way set-associative cache and a companion fully-associative cache of size n. We show that the deterministic competitive ratio for this problem is (n+1)(k+1) - 1, and the randomized competitive ratio is O(log n log k) and ?(log n + log k).

7 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations