scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
01 Jan 2006
TL;DR: A compressed L2 cache design based on a simple compression algorithm with a low decompression overhead is proposed, and an adaptive compression scheme that dynamically adapts to the costs and benefits of cache compression, and employs compression only when it helps performance is developed.
Abstract: Chip multiprocessors (CMPs) combine multiple processors on a single die, typically with private level-one caches and a shared level-two cache. The increasing number of processors cores in a CMP increases the demand on two critical resources: the shared L2 cache capacity and the off-chip pin bandwidth. Such demand is further exacerbated by latency-hiding techniques such as hardware prefetching. In this dissertation, we explore using compression to effectively increase cache and pin bandwidth resources and ultimately CMP performance. We identify two distinct and complementary designs where compression can help improve CMP performance: Cache Compression and Link Compression. Cache compression stores compressed lines in the cache, potentially increasing the effective cache size, reducing off-chip misses and improving performance. Unfortunately, decompression overhead can slow down cache hit latencies, possibly degrading performance. Link (i.e., off-chip interconnect) compression compresses communication messages before sending to or receiving from off-chip system components, thereby increasing the effective pin bandwidth and improving performance for bandwidth-limited configurations. While compression can have a positive impact on CMP performance, practical implementations of compression raise a few concerns. In this dissertation, we make five contributions that address these concerns. We propose a compressed L2 cache design based on a simple compression algorithm with a low decompression overhead. We develop an adaptive compression scheme that dynamically adapts to the costs and benefits of cache compression, and employs compression only when it helps performance. We show that cache and link compression both combine to improve CMP performance for commercial and (some) scientific workloads. We show that compression interacts in a strong positive way with hardware prefetching, whereby a system that implements both compression and hardware prefetching can have a higher speedup than the product of speedups of each scheme alone. We provide a simple analytical model that helps provide qualitative intuition into the trade-off between cores, caches, communication and compression, and use full-system simulation to quantify this trade-off for a set of commercial workloads.

40 citations

Proceedings ArticleDOI
01 Jan 1996
TL;DR: This work describes and evaluates a Stream Memory Controller system that combines compile-time detection of streams with execution-time selection of the access order and issue, and has observed performance improvements by factors of 13 over normal caching.
Abstract: Memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations lack the temporal locality of reference that makes caches effective, they have predictable access patterns. Since most modern DRAM components support modes that make it possible to perform some access sequences faster than others, the predictability of the stream accesses makes it possible to reorder them to get better memory performance. We describe and evaluate a Stream Memory Controller system that combines compile-time detection of streams with execution-time selection of the access order and issue. The technique is practical to implement, using existing compiler technology and requiring only a modest amount of special-purpose hardware. With our prototype system, we have observed performance improvements by factors of 13 over normal caching.

40 citations

Dissertation
03 Oct 1996
TL;DR: This dissertation examines how variable-bit-rate digital video affects the ability to efficiently transport and decompress these videos, and introduces two new bandwidth smoothing techniques that reduce the resource requirements for the transportation of compressed video across networks.
Abstract: Digital video compression techniques, such as the Motion-JPEG and MPEG compression standards, greatly reduce the network and storage requirements for digital video. These techniques, however, result in variable-bit-rate video which make the efficient handling and decompression of the video difficult. In this dissertation, we examine how variable-bit-rate digital video affects the ability to efficiently transport and decompress these videos. We introduce two new bandwidth smoothing techniques that reduce the resource requirements for the transportation of compressed video across networks: the critical bandwidth allocation algorithm, which, given a fixed buffer size, creates a bandwidth plan for the continuous playback of video that (1) requires no prefetching of data, (2) has the minimum number of bandwidth increases, and (3) minimizes the peak bandwidth requirements, and the optimal bandwidth allocation algorithm, which minimizes the total number of bandwidth changes required for continuous video playback. The use of bandwidth smoothing techniques in video-on-demand services results in plans that are somewhat inflexible. To allow users to have VCR capabilities in bandwidth smoothing environments, we analyze the utility of buffered video for decreasing the required interactions with the server. This buffered video, the VCR-window, allows users to have VCR capabilities within a limited without changing the bandwidth allocation levels. For accesses that cannot be serviced through the smoothing buffer, we show how contingency channels can be used to quickly return users to their original bandwidth allocation plans. Software video decompression algorithms have poor cache utilizations because of the long time between accesses to temporal data. We examine two techniques for reducing cache misses for software video decompression, reducing the working-set size and using software prefetching. To reduce the working-set size, we introduce several techniques that re-order the decompression algorithm to exploit the temporal accesses to data. These techniques reduce the cache miss rates by over 50% and can result in better performance. In addition, we examine the impact that software-controlled prefetching has on MPEG video decompression. Our results show that sufficient memory bandwidth exists for prefetching to be beneficial and that miss rates can be reduced by as much as 80%.

40 citations

Journal ArticleDOI
TL;DR: This work shows that evicting more than the minimum number of code blocks from the code cache results in less run-time overhead than the existing alternatives, and describes and evaluates a generational approach to code cache management that makes it easy to identify long-lived code blocks and simultaneously avoid any fragmentation because of the eviction of short-lived blocks.
Abstract: Dynamic binary optimizers store altered copies of original program instructions in software-managed code caches in order to maximize reuse of transformed code. Code caches store code blocks that may vary in size, reference other code blocks, and carry a high replacement overhead. These unique constraints reduce the effectiveness of conventional cache management policies. Our work directly addresses these unique constraints and presents several contributions to the code-cache management problem. First, we show that evicting more than the minimum number of code blocks from the code cache results in less run-time overhead than the existing alternatives. Such granular evictions reduce overall execution time, as the fixed costs of invoking the eviction mechanism are amortized across multiple cache insertions. Second, a study of the ideal lifetimes of dynamically generated code blocks illustrates the benefit of a replacement algorithm based on a generational heuristic. We describe and evaluate a generational approach to code cache management that makes it easy to identify long-lived code blocks and simultaneously avoid any fragmentation because of the eviction of short-lived blocks. Finally, we present results from an implementation of our generational approach in the DynamoRIO framework and illustrate that, as dynamic optimization systems become more prevalent, effective code cache-management policies will be essential for reliable, scalable performance of modern applications.

40 citations

Proceedings ArticleDOI
31 Jan 1998
TL;DR: Small and fast SRAM network caches are explored as a means to reduce the remote stalls and capacity traffic of multiprocessor clusters and a novel and scalable method to control the page cache by integrating page relocation mechanisms into the network victim cache is proposed.
Abstract: The frequency of accesses to remote data is a key factor affecting the performance of all Distributed Shared Memory (DSM) systems. Remote data caching is one of the most effective and general techniques to fight processor stalls due to remote capacity misses in the processor caches. The design space of remote data caches (RDC) has many dimensions and one essential performance trade-off hit ratio versus speed. Some recent commercial systems have opted for large and slow (S)DRAM network caches (NC), but others completely avoid them because of their damaging effects on the remote/local latency ratio. In this paper we will explore small and fast SRAM network caches as a means to reduce the remote stalls and capacity traffic of multiprocessor clusters. The major appeal of SRAM NCs is that they add less penalty on the latency of NC hits and remote accesses. Their small capacity can handle conflict misses and a limited amount of capacity misses. However, they can be coupled with main memory page caches which satisfy the bulk of capacity misses. To maximize performance for a large spectrum of applications, we propose to organize the NC as a victim cache for remote data. We also propose a novel and scalable method to control the page cache by integrating page relocation mechanisms into the network victim cache.

40 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations