scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
01 Dec 1997
TL;DR: This work investigates the design and performance of conflict-avoiding cache architectures based on polynomial modulus functions, which earlier research has shown to be highly effective at reducing conflict miss ratios.
Abstract: High performance architectures depend heavily on efficient multi-level memory hierarchies to minimize the cost of accessing data. This dependence will increase with the expected increases in relative distance to main memory. There have been a number of published proposals for cache conflict-avoidance schemes. We investigate the design and performance of conflict-avoiding cache architectures based on polynomial modulus functions, which earlier research has shown to be highly effective at reducing conflict miss ratios. We examine a number of practical implementation issues and present experimental evidence to support the claim that pseudo-randomly indexed caches are both effective in performance terms and practical from an implementation viewpoint.

41 citations

Posted Content
TL;DR: The potential of deep learning to address the von Neumann bottleneck of memory performance is demonstrated, and contemporary prefetching strategies to n-gram models in natural language processing are related, and recurrent neural networks can serve as a drop-in replacement.
Abstract: The explosion in workload complexity and the recent slow-down in Moore's law scaling call for new approaches towards efficient computing. Researchers are now beginning to use recent advances in machine learning in software optimizations, augmenting or replacing traditional heuristics and data structures. However, the space of machine learning for computer hardware architecture is only lightly explored. In this paper, we demonstrate the potential of deep learning to address the von Neumann bottleneck of memory performance. We focus on the critical problem of learning memory access patterns, with the goal of constructing accurate and efficient memory prefetchers. We relate contemporary prefetching strategies to n-gram models in natural language processing, and show how recurrent neural networks can serve as a drop-in replacement. On a suite of challenging benchmark datasets, we find that neural networks consistently demonstrate superior performance in terms of precision and recall. This work represents the first step towards practical neural-network based prefetching, and opens a wide range of exciting directions for machine learning in computer architecture research.

41 citations

Dissertation
01 Jan 2000
TL;DR: This dissertation analyzes IBM's CodePack decompression algorithm and proposes optimizations for it, and explores two software optimizations, hybrid programs and memoization, to improve the execution time of compressed programs by reducing the compression.
Abstract: Code compression is the technique of using data compression to reduce the program memory size for memory-limited, embedded computers. For system-on-a-chip designs, this reduces the system die area which lowers die cost. After compilation, the binary (native code) program is compressed and stored in the embedded system. At run-time, the compressed program is incrementally decompressed and executed. While compressed programs have better code density, their performance is typically lower because additional effort is required to decompress the instruction stream. This dissertation presents methods to improve the performance of compressed programs. Decompression overhead can be minimized by using special-purpose hardware. This dissertation analyzes IBM's CodePack decompression algorithm and proposes optimizations for it. The optimized decompressor can often execute compressed programs faster than the original native program. The performance benefit of using fewer memory transactions to fetch compressed instructions surpasses the small decompression overhead. Therefore, code compression improves performance as well as code density. The decompression hardware can be largely replaced with software. The benefits of software decompression are greater design flexibility, reduced hardware complexity, reduced die area, and reduced cost. However, software decompression is much slower than hardware decompression. On a 5-stage pipelined embedded processor with a 4KB instruction cache, CodePack programs execute 1.3 to 27.0 times slower than native programs and reduce program memory die area (instruction cache and main memory) by 26% to 41%. This dissertation proposes instruction set support to enable efficient software-managed decompression. In addition, it explores two software optimizations, hybrid programs and memoization, to improve the execution time of compressed programs by reducing the compression. Hybrid programs contain both native and compressed code to reduce the number of times the decompressor is invoked. Memoization is a dynamic optimization that caches recent decompression results to also avoid invoking the decompressor. Optimized compressed programs that reduce die area 10% to 33% execute only 1.00 to 1.22 times slower than native code. In addition, loop-oriented (multimedia) programs are nearly as fast as native code.

41 citations

Journal ArticleDOI
TL;DR: It is demonstrated that an LHCA design can provide a geometric mean 6% IPC improvement over a baseline 3-level SRAM cache design under the same area constraint across a collection of 30 workloads, and two types of hybrid cache architectures are discussed and evaluated.
Abstract: Traditional multilevel SRAM-based cache hierarchies, especially in the context of chip multiprocessors (CMPs), present many challenges in area requirements, core--to--cache balance, power consumption, and design complexity. New advancements in technology enable caches to be built from other technologies, such as Embedded DRAM (EDRAM), Magnetic RAM (MRAM), and Phase-change RAM (PRAM), in both 2D chips or 3D stacked chips. Caches fabricated in these technologies offer dramatically different power-performance characteristics when compared with SRAM-based caches, particularly in the areas of access latency, cell density, and overall power consumption. In this article, we propose to take advantage of the best characteristics that each technology has to offer through the use of Hybrid Cache Architecture (HCA) designs.We discuss and evaluate two types of hybrid cache architectures: intercache-Level HCA (LHCA), in which the levels in a cache hierarchy can be made of disparate memory technologies; and intracache-level or cache-Region-based HCA (RHCA), where a single level of cache can be partitioned into multiple regions, each of a different memory technology. We have studied a number of different HCA architectures and explored the potential of hardware support for intracache data movement and power consumption management within HCA caches. Utilizing a full-system simulator that has been validated against real hardware, we demonstrate that an LHCA design can provide a geometric mean 6p IPC improvement over a baseline 3-level SRAM cache design under the same area constraint across a collection of 30 workloads. A more aggressive RHCA-based design provides 10p IPC improvement over the baseline. A 2-layer 3D cache stack (3DHCA) of high density memory technology within the same chip footprint gives 16p IPC improvement over the baseline. We also achieve up to a 72p reduction in power consumption over a baseline SRAM-only design. Energy-delay and thermal evaluation for 3DHCA are also presented. In addition to the fast-slow region based RHCA, we further evaluate read-write region based RHCA designs.

41 citations

Patent
Per Holmberg1
08 Nov 2001
TL;DR: In this article, a data processing system and method involving a data requesting element and a first memory element from which said data requested element requests data is described, where a second memory element stores data units read out of the first cache element, and performs a prefetch procedure, where said pre-fetch procedure contains both a sequential subprocedure and a sub-procedures based on prefetch data identifiers associated with some of the data units.
Abstract: A data processing system and method involving a data requesting element and a first memory element from which said data requesting element requests data is described. An example of such a system is a processor and a first level cache memory, or two memories arranged in a hierarchy. A second memory element is provided between the first memory element and the requesting element. The second memory element stores data units read out of said first memory element, and performs a prefetch procedure, where said prefetch procedure contains both a sequential sub-procedure and a sub-procedure based on prefetch data identifiers associated with some of the data units.

41 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations