scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
01 Dec 2000
TL;DR: A hardware technique is proposed, the DLL BTB, that adds a small second buffer to the BTB and dedicates it to storing DLL target addresses, and it is shown that the Dll BTB performance is similar to a BTB with a victim buffer, but the D LL BTB requires no parallel lookups or datapaths.
Abstract: Dynamically Linked Libraries (DLLs) promote software modularity, portability, and flexibility and their use has become widespread. The authors characterize the behavior of five applications that make heavy use of DLLs, with a particular focus on the effects of DLLs on Branch Target Buffer (BTB) performance. DLLs aggravate hot set contention in the BTB. Standard software remedies are ineffective because the DLLs are shared, compiled separately, and dynamically linked to applications. We propose a hardware technique, the DLL BTB, that adds a small second buffer to the BTB and dedicates it to storing DLL target addresses. We show that the DLL BTB performance is similar to a BTB with a victim buffer, but the DLL BTB requires no parallel lookups or datapaths between the original BTB and the added buffer.

12 citations

Posted Content
TL;DR: The evaluations show that the GPU-aware cache and memory management techniques proposed in this dissertation are effective at mitigating the interference caused by GPUs on current and future GPU-based systems.
Abstract: The continued growth of the computational capability of throughput processors has made throughput processors the platform of choice for a wide variety of high performance computing applications Graphics Processing Units (GPUs) are a prime example of throughput processors that can deliver high performance for applications ranging from typical graphics applications to general-purpose data parallel (GPGPU) applications However, this success has been accompanied by new performance bottlenecks throughout the memory hierarchy of GPU-based systems We identify and eliminate performance bottlenecks caused by major sources of interference throughout the memory hierarchy We introduce changes to the memory hierarchy for systems with GPUs that allow the memory hierarchy to be aware of both CPU and GPU applications' characteristics We introduce mechanisms to dynamically analyze different applications' characteristics and propose four major changes throughout the memory hierarchy We propose changes to the cache management and memory scheduling mechanisms to mitigate intra-application interference in GPGPU applications We propose changes to the memory controller design and its scheduling policy to mitigate inter-application interference in heterogeneous CPU-GPU systems We redesign the MMU and the memory hierarchy in GPUs to be aware of ddress-translation data in order to mitigate the inter-address-space interference We introduce a hardware-software cooperative technique that modifies the memory allocation policy to enable large page support in order to further reduce the inter-address-space interference at the shared Translation Lookaside Buffer (TLB) Our evaluations show that the GPU-aware cache and memory management techniques proposed in this dissertation are effective at mitigating the interference caused by GPUs on current and future GPU-based systems

12 citations

Proceedings ArticleDOI
01 Apr 1994
TL;DR: The broad thesis presented suggests that the serial emulation of a parallel algorithm has the potential advantage of running an a serial machine faster than a standard serial algorithm for the same problem.
Abstract: The broad thesis presented suggests that the serial emulation of a parallel algorithm has the potential advantage of running an a serial machine faster than a standard serial algorithm for the same problem. It is too early to reach definite conclusions regarding the significance of this thesis. However, using some imagination, validity of the thesis and some arguments supporting it may lead to several far-reaching outcomes: (1) Reliance on "predictability of reference" in the design of computer systems will increase. (2) Parallel algorithms will be taught as part of the standard computer science and engineering undergraduate curricula irrespective of whether (or when) parallel processing will become ubiquitous in the general-purpose computing world. (3) A strategic agenda for high-performance parallel computing: a multistage agenda, which in no stage compromises user-friendliness of the programmer's model, and thereby potentially alleviates the so-called "parallel software crisis". Stimulating a debate is one goal of our presentation. >

12 citations

Proceedings ArticleDOI
01 May 2000
TL;DR: A framework for studying spatial locality focusing on the characteristics of the spatial locality in terms of closeness in time and space to get the amount of accessed sequential data and the potential for cache hits and to investigate where potential bottlenecks are located.
Abstract: The performance gap between processors and memory is increasing, making the cache hit-rate paramount for performance. Studies show room for improvement, especially in data caches. The cache effectiveness is dictated by software locality, hence the software behavior directs the cache performance. This paper presents a framework for studying spatial locality. It focuses on the characteristics of the spatial locality in terms of closeness in time and space, to get the amount of accessed sequential data and the potential for cache hits. By using the framework we gain knowledge for improving the cache performance. Our experiment consists of a program driven simulator and 11 important applications. We show a large performance potential in the data cache with up to 75% less miss rate, exploiting spatial locality. In order to investigate where potential bottlenecks are located we make a simple implementation of a scheme to exploit this spatial locality.

12 citations

Proceedings Article
01 Jan 2004
TL;DR: This paper explores a similar cache organization providing architectural support for distinguishing between memory references that exhibit spatial and temporal locality and mapping them to separate caches, leading to substantial improvements in terms of cache misses.
Abstract: to substantial improvements in terms of cache misses. In addition, such a separation allowed for the design of caches that could be tailored to meet the properties exhibited by different data items. In this paper we explore a similar cache organization providing architectural support for distinguishing between memory references that exhibit spatial and temporal locality and mapping them to separate caches. Since significant amounts of compulsory and conflict misses are avoided, the size of each cache (i.e., array and scalar), as well as the combined cache capacity can be reduced. According to the results of our simulations a partitioned 4k scalar cache with the streams (or arrays) mapped to a 2k array cache can be more efficient than a 16k unified data cache.

12 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations