scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
07 Jun 2004
TL;DR: A virtualization layer is introduced that allows reconfigurable application-specific coprocessors to access the user-space virtual memory and share the memory address space with user applications and performs runtime optimizations.
Abstract: Reconfigurable Systems-on-Chip (SoCs) on the market consist of full-fledged processors and large Field-Programmable Gate-Arrays (FPGAs). The latter can be used to implement the system glue logic, various peripherals, and application-specific coprocessors. Using FPGAs for application-specific coprocessors has certain speedup potentials, but it is less present in practice because of the complexity of interfacing the software application with the coprocessor. Another obstacle is the lack of portability across different systems. In this work, we present a virtualisation layer consisting of an operating-system extension and a hardware component. It lowers the complexity of interfacing and increases portability potentials, while it also allows the coprocessor to access the user virtual memory through a virtual memory window. The burden of moving data between processor and coprocessor is shifted from the programmer to the operating system. Since the virtualisation layer components hide physical details of the system, user designed hardware and software become perfectly portable. A reconfigurable SoC running Linux is used to prove the viability of the concept. Two applications are ported to the system for testing the approach, with their critical functions mapped to the specific coprocessors. We show a significant speedup compared to the software versions, while limited penalty is paid for virtualisation.

37 citations

Patent
Stephen J. Walsh1
22 Dec 1992
TL;DR: In this paper, a cache cache system for high speed computers provides adaptive set associativity by means of which the degree of associativity of the cache is temporarily and dynamically increased in response to the behavior of the system.
Abstract: A memory cache system for high speed computers provides adaptive set associativity by means of which the degree of associativity of the cache is temporarily and dynamically increased in response to the behavior of the system. More particularly, one or more microcaches is temporarily assigned to a particular location in the main cache memory in response to frequent probes and misses at that location. The temporarily assigned microcache increases the set associativity of that particular location until the microcache is assigned to another location. The reassignability of the microcache reduces the total cache size required as well as the complexity of the control circuitry needed to manage the cache. A cache miss threshold is established which must be exceeded before a location is assigned the microcache, and the microcache remains assigned for only a fixed number of cache accesses called the window size. A least recently used replacement algorithm is used to overwrite entries in the cache.

37 citations

Proceedings ArticleDOI
Lingda Li1, Dong Tong1, Zichao Xie1, Junlin Lu1, Xu Cheng1 
19 Sep 2012
TL;DR: In the last-level cache, large amounts of blocks have reuse distances greater than the available cache capacity, so some subset of these distant reuse blocks can reside in the cache longer and be removed.
Abstract: In the last-level cache, large amounts of blocks have reuse distances greater than the available cache capacity Cache performance and efficiency can be improved if some subset of these distant reuse blocks can reside in the cache longer The bypass technique is an effective and attractive solution that prevents the insertion of harmful blocks

37 citations

Patent
08 Nov 1994
TL;DR: In this article, an address translation unit employing a direct-mapped translation lookaside buffer and a relatively small, associative victim translation look aside buffer for translating linear addresses to physical addresses is described.
Abstract: An address translation unit is disclosed employing a direct-mapped translation lookaside buffer and a relatively small, associative victim translation lookaside buffer for translating linear addresses to physical addresses expediently and avoiding thrashing, without requiring large amounts of hardware and space.

36 citations

Book
07 Dec 2009
TL;DR: This book gives a comprehensive description of the architecture of microprocessors from simple in-order short pipeline designs to out-of-order superscalars and demonstrates how design features enhance performance as well as complexity.
Abstract: This book gives a comprehensive description of the architecture of microprocessors from simple in-order short pipeline designs to out-of-order superscalars. It discusses topics such as - the policies and mechanisms needed for out-of-order processing such as register renaming, reservation stations, and reorder buffers - optimizations for high performance such as branch predictors, instruction scheduling, and load-store speculations - design choices and enhancements to tolerate latency in the cache hierarchy of single and multiple processors - state-of-the-art multithreading and multiprocessing emphasizing single chip implementations Topics are presented as conceptual ideas, with metrics to assess the performance impact, if appropriate, and examples of realization. The emphasis is on how things work at a black box and algorithmic level. The author also provides sufficient detail at the register transfer level so that readers can appreciate how design features enhance performance as well as complexity.

36 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations