scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
12 Oct 1997
TL;DR: It is demonstrated that application-specific hardware and policies can provide substantial improvements in performance on a per application basis and that an application-driven machine customization provides a promising approach to achieve high performance and combat performance fragility.
Abstract: We propose a machine architecture that integrates programmable logic into key components of the system with the goal of customizing architectural mechanisms and policies to match an application. This approach presents an improvement over the traditional approach of exploiting programmable logic as a separate co-processor by pre-serving machine usability through software and on a traditional computer architecture by providing application-specific hardware. We present two case studies of architectural customization to enhance latency tolerance and efficiently utilize network bisection on multiprocessors for sparse matrix computations. We demonstrate that application-specific hardware and policies can provide substantial improvements in performance on a per application basis. Based on these preliminary results, we propose that an application-driven machine customization provides a promising approach to achieve high performance and combat performance fragility.

18 citations

Proceedings ArticleDOI
12 Oct 1999
TL;DR: This work analyzes the memory performance of three image processing algorithms on both a conventional memory system and on the Impulse memory system, which allows application software to control how, when, and where data are loaded into a conventional processor cache.
Abstract: Image processing applications tend to access their data non-sequentially and reuse that data infrequently. As a result, they tend to perform poorly on conventional memory systems due to high cache and TLB miss rates and are particularly sensitive to the growing latency of main memory. In this paper, we analyze the memory performance of three image processing algorithms (volume rendering, image warping, and image filtering) on both a conventional memory system and on the Impulse memory system. The Impulse memory system allows application software to control how, when, and where data are loaded into a conventional processor cache. It does this by letting software configure how the memory controller interprets the physical addresses exported by the processor, which enables an application to dynamically change how data are fetched. Sparse data can be accessed densely, which improves both cache and TLB utilization, and memory latency is hidden by prefetching data within the memory controller. We find that for these image processing codes, using an Impulse memory system yields speedups of 40% to 226% over an otherwise identical machine with a conventional memory system.

18 citations

Proceedings ArticleDOI
30 Jul 2012
TL;DR: This work addresses the problem of high number of conflict misses in low-associative caches, by proposing an indexing policy that adaptively selects the bits from the block address used to index the cache by selecting at run time those bits that disperse the working set more evenly across the available sets.
Abstract: The design of cache memories is a crucial part of the design cycle of a modern processor. Unfortunately, caches with low degrees of associativity suffer a large amount of conflict misses, while high-associative caches consume more power per access. We propose ASCIB, a simple technique able to dynamically adjust the bits used for cache indexing so as to minimize conflict misses. By selecting at run time the bits that disperse the working set more evenly across the available sets, ASCIB removes 73% of the conflict misses on average. This results in an improvement in energy efficiency by 17% on average.

18 citations

Proceedings ArticleDOI
23 Sep 2001
TL;DR: A new TLB structure supporting two page sizes dynamically and selectively for high performance and low cost design without any operating system support is proposed.
Abstract: This research is to design a simple but high performance TLB (translation lookaside buffer) system with low power consumption. Thus, we propose a new TLB structure supporting two page sizes dynamically and selectively for high performance and low cost design without any operating system support. For high performance, a promotion-TLB is designed by supporting two page sizes. Also in order to attain low power consumption, a banked-TLB is constructed by dividing one fully associative TLB space into two sub fully associative TLBs. These two structures are integrated to form a banked-promotion TLB as a low power and high performance TLB structure for embedded processors. According to the results of comparison and analysis, a similar performance can be achieved by using fewer TLB entries and also energy dissipation can be reduced by around 50% compared with the fully associative TLB.

18 citations

Proceedings ArticleDOI
05 May 2008
TL;DR: Using ideas previously developed for criticality-based allocation of resources, prefetching, and prefetch buffering, this work allows design engineers to aggressively set the clock frequency without worrying about the subset of components that cannot meet this frequency.
Abstract: We develop architectural techniques for mitigating the impact of process variability. Our techniques hide the performance effects of slow components-including registers, functional units, and L1I and L1D cache frames-without slowing the clock frequency or pessimistically assuming that all components are slow. Using ideas previously developed for other purposes-criticality-based allocation of resources, prefetching, and prefetch buffering-we allow design engineers to aggressively set the clock frequency without worrying about the subset of components that cannot meet this frequency. Our techniques outperform speed binning, because clock frequency benefits outweigh slight losses in IPC.

18 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations