scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
01 Dec 2010
TL;DR: A coordinated strategy to reduce both capacity and conflict misses by changing the placement and insertion policies of the cache, called Bimodal Set Balancing Cache, which reduced the average miss rate of a baseline 2MB 8-way second level cache by 16%, which translated into an average IPC improvement of 4.8%.
Abstract: The well-known memory wall problem has motivated wide research in the design of caches Last-level caches, whose misses can stall the processors for hundreds of cycles, have received particular attention Strategies to modify adaptably the cache insertion, promotion, eviction and even placement policies have been proposed, some techniques being better at reducing different kinds of misses For example changes in the placement policy of a cache, which are a natural option to reduce conflict misses, can do little to fight capacity misses, which depend on the relation between the working set of the application and the cache size Nevertheless, other techniques such as the recently proposed dynamic insertion policy (DIP), whose aim is to retain a fraction of the working set in the cache when it is larger than the cache size, attack primarily capacity misses In this paper we present a coordinated strategy to reduce both capacity and conflict misses by changing the placement and insertion policies of the cache Our strategy takes its decisions based on the concept of the Set Saturation Level (SSL), which tries to measure to which degree a set can hold its working set Despite requiring only less than 1% storage overhead, our proposal, called Bimodal Set Balancing Cache, reduced the average miss rate of a baseline 2MB 8-way second level cache by 16%, which translated into an average IPC improvement of 48% in our experiments

8 citations

Proceedings ArticleDOI
07 Nov 2002
TL;DR: A two-level cache system that exploits both temporal and spatial localities effectively effectively is proposed as the cache structure for a RAID system and according to the results of simulation, the hit ratio and hit times can be improved.
Abstract: In a RAID system, the cache is one of the important factors that can affect general system performance. As a two-level cache usually brings better performance than a one-level cache in the processors of personal computers and embedded systems, a two-level cache system that exploits both temporal and spatial localities effectively is proposed as the cache structure for a RAID system. The proposed cache system consists of two layers of caches, i.e., a set associative cache with small block size and a fully associative spatial cache with large block size. According to the results of simulation, the hit ratio and hit times can be improved with the two-level cache structure.

8 citations

Proceedings ArticleDOI
04 Oct 2009
TL;DR: This paper investigates quantitatively the performance impact of faults using a superscalar, dynamically-scheduled, out-of-order, Alpha-like microprocessor, on which the authors execute SPEC2000 integer benchmarks and provides extensive fault simulation-based experimental results.
Abstract: Towards improving performance, modern microprocessors incorporate a variety of architectural features, such as branch prediction and speculative execution, which are not critical to the correctness of their operation. While faults in the corresponding hardware may not necessarily affect functional correctness, they may, nevertheless, adversely impact performance. In this paper, we investigate quantitatively the performance impact of such faults using a superscalar, dynamically-scheduled, out-of-order, Alpha-like microprocessor, on which we execute SPEC2000 integer benchmarks. We provide extensive fault simulation-based experimental results and we discuss how this information may guide the inclusion of additional hardware for performance loss recovery and yield enhancement.

8 citations

Proceedings ArticleDOI
20 Apr 2009
TL;DR: Two complementary techniques to address the problem of harmful prefetches in the context of shared L2 based CMPs are proposed and are evaluated using two embedded application codes to extract significant benefits from software prefetching even with large core counts.
Abstract: Chip multiprocessors (CMPs) present a unique scenario for software data prefetching with subtle tradeoffs between memory bandwidth and performance. In a shared L2 based CMP, multiple cores compete for the shared on-chip cache space and limited off-chip pin bandwidth. Purely software based prefetching techniques tend to increase this contention, leading to degradation in performance. In some cases, prefetches can become harmful by kicking out useful data from the shared cache whose next usage is earlier than the prefetched data, and the fraction of such harmful prefetches usually increases when we increase the number of cores used for executing a multi-threaded application code. In this paper, we propose two complementary techniques to address the problem of harmful prefetches in the context of shared L2 based CMPs. These techniques, namely, suppressing select data prefetches (if they are found to be harmful) and pinning select data in the L2 cache (if they are found to be frequent victim of harmful prefetches), are evaluated in this paper using two embedded application codes. Our experiments demonstrate that these two techniques are very effective in mitigating the impact of harmful prefetches, and as a result, we extract significant benefits from software prefetching even with large core counts.

8 citations

01 Jan 2009
TL;DR: This paper analyzes the performance of several alternatives that can be considered for a NUCA model according to the four policies that determine its behavior: bank placement, bank access, bank migration and bank replacement.
Abstract: Non-Uniform Cache Architectures (NUCA) have been proposed as a solution to overcome wire delays that will dominate on-chip latencies in Chip Multiprocessor designs in the near future. This novel means of organization divides the total memory area into a set of banks that provides non-uniform access latencies and thus faster access to those banks that are close to the processor. A NUCA model can be characterized according to the four policies that determine its behavior: bank placement, bank access, bank migration and bank replacement. Placement determines the first location of data, access defines the searching algorithm across the banks, migration decides data movements inside the memory and replacement deals with the evicted data. This paper analyzes the performance of several alternatives that can be considered for each of these four policies. Moreover, the Parsec benchmark suite has been used to handle this evaluation because it is a representative group of upcoming sharedmemory programs for Chip Multiprocessors. The results may help researchers to identify key features of NUCA organizations and to open up new areas of investigation.

8 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations