scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Book ChapterDOI
TL;DR: Modern graphics processing units (GPUs) are well provisioned to support the concurrent execution of thousands of threads, but different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores.
Abstract: Modern graphics processing units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive.

5 citations

Journal ArticleDOI
TL;DR: The proposed Band-pass prefetching mechanism is a simple and low-overhead mechanism to effectively manage prefetchers in multicore systems, consisting of local and global prefetcher aggressiveness control components, which altogether, control the flow of prefetch requests between a range of Prefetch to demand requests ratios.
Abstract: In multi-core systems, an application’s prefetcher can interfere with the memory requests of other applications using the shared resources, such as last level cache and memory bandwidth. In order to minimize prefetcher-caused interference, prior mechanisms have been proposed to dynamically control prefetcher aggressiveness at runtime. These mechanisms use several parameters to capture prefetch usefulness as well as prefetcher-caused interference, performing aggressive control decisions. However, these mechanisms do not capture the actual interference at the shared resources and most often lead to incorrect aggressiveness control decisions. Therefore, prior works leave scope for performance improvement.Toward this end, we propose a solution to manage prefetching in multicore systems. In particular, we make two fundamental observations: First, a positive correlation exists between the accuracy of a prefetcher and the amount of prefetch requests it generates relative to an application’s total (demand and prefetch) requests. Second, a strong positive correlation exists between the ratio of total prefetch to demand requests and the ratio of average last level cache miss service times of demand to prefetch requests. In this article, we propose Band-pass prefetching that builds on those two observations, a simple and low-overhead mechanism to effectively manage prefetchers in multicore systems. Our solution consists of local and global prefetcher aggressiveness control components, which altogether, control the flow of prefetch requests between a range of prefetch to demand requests ratios. From our experiments on 16-core multi-programmed workloads, on systems using stream prefetching, we observe that Band-pass prefetching achieves 12.4% (geometric-mean) improvement on harmonic speedup over the baseline that implements no prefetching, while aggressive prefetching without prefetcher aggressiveness control and state-of-the-art HPAC, P-FST, and CAFFEINE achieve 8.2%, 8.4%, 1.4%, and 9.7%, respectively. Further evaluation of the proposed Band-pass prefetching mechanism on systems using AMPM prefetcher shows similar performance trends. For a 16-core system, Band-pass prefetching requires only a modest hardware cost of 239 bytes.

5 citations

Patent
15 Feb 2016
TL;DR: In this article, a method for modeling memory access behavior and memory traffic timing behavior is described, which includes receiving data indicative of memory access behaviour resulting from instructions executed on a processor.
Abstract: Systems and methods for modeling memory access behavior and memory traffic timing behavior are disclosed. According to an aspect, a method includes receiving data indicative of memory access behavior resulting from instructions executed on a processor. The method also includes determining a statistical profile of the memory access behavior, the profile including tuple statistics of memory access behavior. Further, the method includes generating a clone of the executed instructions based on the statistical profile for use in simulating the memory access behavior.

5 citations

Proceedings ArticleDOI
17 Jun 2001
TL;DR: A new compiler-directed renaming mechanism focused on prefetch instructions that can be implemented at a very low cost in terms of area and will not impact cycle time is presented and results for a set of numerical application show a speedup of 5% to 22%, and in any case no performance degradation was observed.
Abstract: The detection and correct handling of data and control dependencies constitutes one of the biggest issues to expose ILP in current architectures. The ever increasing memory latencies and working space of programmes are making prefetching techniques crucial for the attainment of sustained high performance. Software prefetching allows the compiler to use information discovered at compile-time to effectively bring needed data before it is used, thus hiding all or part of the latency from main memory.On the other hand, renaming is a technique that allows the hardware to break register naming dependencies, thus exposing more parallelism to the hardware. In this paper we will present a new compiler-directed renaming mechanism focused on prefetch instructions. The compiler informs the hardware on the association of prefetch and load instructions, thus making it possible for the hardware to convert non-binding prefetches in to binding prefetches, without any of the compile-time limitations this other kind of prefetching may have.The mechanism can be implemented at a very low costin terms of area and we believe it will not impact cycle time. The research presented in this paper is at a first stage; nevertheless, our results for a set of numerical application show a speedup of 5% to 22%, and in any case no performance degradation was observed.

5 citations

Journal Article
TL;DR: While hybrid prefetching improves performance over a single prefetcher, it is demonstrated that the main benefit of adaptivity is the reduction in the number of prefetches without hurting performance.
Abstract: A set of hybrid and adaptive prefetching schemes are considered in this paper. The prefetchers are hybrid in that they use combinations of Stride, Sequential and C/DC prefetching mechanisms. The schemes are adaptive in that their aggressiveness is adjusted based on feedback metrics collected dynamically during program execution. Metrics such as prefetching accuracy and prefetching timeliness are used to vary aggressiveness in terms of prefetch distance (how far ahead of the current miss it fetches) and prefetch degree (the number of prefetches issued). The scheme is applied separately at both the L1 and L2 cache levels. We previously proposed a Hybrid Adaptive prefetcher for the Data Prefetching Competition (DPC-1) which uses a hybrid PC-based Stride/Sequential prefetcher. In this work, we also consider hybrid schemes that use the CZone with Delta Correlations (C/DC) prefetcher. Further, we breakdown the components of hybrid adaptive prefetching by evaluating the individual benefits of using hybrid schemes and using adaptivity. While hybrid prefetching improves performance over a single prefetcher, we demonstrate that the main benefit of adaptivity is the reduction in the number of prefetches without hurting performance. For a set of 18 SPEC CPU 2006 benchmarks, the best hybrid adaptive scheme achieves an average (geometric mean) CPI improvement of 35% compared to no prefetching and 19% compared to a basic Stride prefetcher. Adaptivity makes it possible to reduce the number of prefetches issued by as much as 50% in some benchmarks without sacrificing performance.

5 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations