scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
12 Oct 1997
TL;DR: This work proposes two schemes for implementing higher associativity: the sequential multi- column cache, which is an extension of the column associative cache, and the parallel multi-column cache, both of which achieve the low miss rate of a 4-way set-association cache.
Abstract: We propose two schemes for implementing higher associativity: the sequential multi-column cache, which is an extension of the column associative cache, and the parallel multi-column cache. In order to achieve the same access cycle time as that of a direct-mapped cache, data memory in the cache is organized into one bank in both schemes. We use the multiple MRU block technique to increase the first hit ratio, thus reducing the average access time. While the parallel multi-column cache performs the tag checking in parallel, the sequential multi-column cache sequentially searches through places in a set, and uses index information to filter out unnecessary probes. In the case of an associativity of 4, they both achieve the low miss rate of a 4-way set-associative cache. Our simulation results using ATUM traces show that both schemes can effectively reduce the average access time.

8 citations

Proceedings ArticleDOI
01 Dec 1997
TL;DR: A new kind of prediction cache is introduced, which combines the features of prefetching and victim caching, and is shown to be more effective at reducing miss rate and improving performance than existing prediction caches.
Abstract: Processor cycle times are currently much faster than memory cycle times, and this gap continues to increase. Adding a high speed cache memory allows the processor to run at full speed, as long as the data it needs is present in the cache. However, memory latency still affects performance in the case of a cache miss. Prediction caches use a history of recent cache misses to predict future misses and to reduce the overall cache miss rate. This paper describes several prediction caches, and introduces a new kind of prediction cache, which combines the features of prefetching and victim caching. This new cache is shown to be more effective at reducing miss rate and improving performance than existing prediction caches.

8 citations

Proceedings ArticleDOI
17 Jun 2006
TL;DR: Spim-cache is introduced, a simple execution-driven cache simulator to carry out such experiments, intended to use in undergraduate courses, and allows, in an intuitive and easy way, to select a given cache organization and run step-by-step the code proposed while visualizing dynamic changes in the cache's state.
Abstract: Cache memories are the most ubiquitous mechanisms devoted to hide memory latencies in current microprocessors. Due to this importance, they are a core topic in computer architecture curricula, both in graduate and undergraduate courses. As a consequence, traditional literature and current educational proposals devote important efforts to this topic. In this context, exercises dealing with simple algorithms, also known as code-based exercises, have a good acceptance among instructors because they permit students to realize how the accesses generated by the programs affect the cache's state.From about one decade ago, simulators have been extensively employed as a valuable pedagogical tool as they enable students to visualize how computer units work and interact each other. Unfortunately, there is no simple simulator allowing to perform code-based exercises for cache memories. Hence, students perform these exercises by means of the classic "paper and pencil" methodology.In this paper we introduce Spim-cache, a simple execution-driven cache simulator to carry out such experiments, intended to use in undergraduate courses. The tool allows, in an intuitive and easy way, to select a given cache organization and run step-by-step the code proposed while visualizing dynamic changes in the cache's state.

8 citations

Proceedings ArticleDOI
20 May 2014
TL;DR: A new prefetching technique, Multiple Stream Tracker (MST), that improves over state-of-the-art by identifying strided accesses in a cache miss stream and has lower average memory bandwidth requirements compared to prior techniques.
Abstract: Data prefetching is a very important technique for hiding memory latency and improving performance in modern computer processors. Existing techniques are not able to find all or best data streams to prefetch. This paper proposes a new prefetching technique, Multiple Stream Tracker (MST), that improves over state-of-the-art by identifying strided accesses in a cache miss stream. Targeting the lower levels of cache it searches for the best among all possible strided streams to prefetch. A technique to efficiently search and rank multiple strided streams is proposed. The proposed technique can identify streams that subsume streams generated by both delta correlated and standard stride prefetchers. The MST pefetcher can also significantly improve performance in parallel programs. The Multiple Stream Tracker applied at the L3 cache improves the IPC by up to 173% (14% on average) over stride prefetching for SPEC CPU2006 benchmarks. The improvement is up to 92% over delta correlation (5% on average). The speedup for SPEComp programs is up to 300% over delta correlation (22% on average). MST also has lower average memory bandwidth requirements compared to prior techniques.

8 citations

Journal ArticleDOI
TL;DR: Two approaches are investigated: Decoded Loop Instruction Cache based PrefetchingDLICP that is most effective for loop intensive applications, and the enhanced DLICP with the popular existing Next Line prefetching NLP for applications of a moderate number of loops.
Abstract: Instruction prefetching is an effective way to improve performance of the pipelined processors. However, existing instruction prefetching schemes increase performance with a significant energy sacrifice, making them unsuitable for embedded and ubiquitous systems where high performance and low energy consumption are all demanded. This paper proposes reducing energy overhead in instruction prefetching by using a simple hardware/software design and an efficient prefetching operation scheme. Two approaches are investigated: Decoded Loop Instruction Cache based Prefetching DLICP that is most effective for loop intensive applications, and the enhanced DLICP with the popular existing Next Line Prefetching NLP for applications of a moderate number of loops. The experimental results show that both DLICP and the enhanced DLICP deliver improved performance at a much reduced energy overhead.

7 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations