scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Patent
16 Jun 1994
TL;DR: In this paper, a cache line replacing apparatus is proposed for use in a computer system having a central processing unit (CPU), a main memory and a cache memory, in which the cache line information of a CPU/cache bus is written-back into the main memory, and desired cache lines information is read from the memory to a memory bus.
Abstract: A cache line replacing apparatus for use in a computer system having a central processing unit (CPU), a main memory and a cache memory, in which a cache line information of a CPU/cache bus is written-back into the main memory, and desired cache line information is read from the main memory to a memory bus to transmit the read cache line information to the CPU/cache bus, the cache line replacing apparatus including a first storage unit in which write-back data of the CPU/cache bus is stored, a second storage unit in which data which is read from the main memory to the memory bus is stored, a register for increasing a count value thereof when data is stored in the second storage means and decreasing the count value when the data is read from the second storage means, and a multiplexer for selectively transmitting the data stored in the second storage means to the CPU/cache bus or transmitting the data of the memory bus to the CPU/cache bus, according to the count value of the register. Accordingly, the time delay due to the write-back buffering can be avoided and the CPU can read the data without the loss of the memory bandwidth.

26 citations

Proceedings ArticleDOI
22 Sep 2005
TL;DR: Results show that the proposed triangular D-NUCA cache approach is particular useful in the embedded applications domain, as it permits the utilization of half-sized NUCA caches with performance improvements.
Abstract: Embedded applications require high performance processors integrating fast and low-power cache. Dynamic non-uniform cache architectures (D-NUCA) have been proposed to overcome the performance limit introduced by wire delays when designing large cache. In this paper, we propose alternative designs of D-NUCA cache, namely triangular D-NUCA cache, to reduce power consumption and silicon area occupancy of D-NUCA cache. We compare the performances of triangular D-NUCA cache with conventional rectangular organization. Results show that our approach is particular useful in the embedded applications domain, as it permits the utilization of half-sized NUCA cache with performance improvements.

26 citations

Proceedings ArticleDOI
13 Dec 2014
TL;DR: This paper presents the code block working set (CBWS) prefetcher, which captures the working set of complete loop iterations using a single context and improves the performance of existing prefetchers when dealing with tight loops.
Abstract: Memory prefetchers predict streams of memory addresses that are likely to be accessed by recurring invocations of a static instruction. They identify an access pattern and prefetch the data that is expected to be accessed by pending invocations of the said instruction. A stream, or a prefetch context, is thus typically composed of a trigger instruction and an access pattern. Recurring code blocks, such as loop iterations may, however, include multiple memory instructions. Accurate data prefetching for recurring code blocks thus requires tight coordination across multiple prefetch contexts. This paper presents the code block working set (CBWS) prefetcher, which captures the working set of complete loop iterations using a single context. The prefetcher is based on the observation that code block working sets are highly interdependent across tight loop iterations. Using automated annotation of tight loops, the prefetcher tracks and predicts the working sets of complete loop iterations. The proposed CBWS prefetcher is evaluated using a set of benchmarks from the SPEC CPU2006, PARSEC, SPLASH and Parboil suites. Our evaluation shows that the CBWS prefetcher improves the performance of existing prefetchers when dealing with tight loops. For example, we show that the integration of the CBWS prefetcher with the state-of-the-art spatial memory streaming (SMS) prefetcher achieves an average speedup of 1.16× (up to 4× ), compared to the standalone SMS prefetcher.

26 citations

01 Jan 1997
TL;DR: This paper presents a novel design of the Selective and Dual Data Caches in which the locality estimation is based on a locality analysis performed at compile-time, and extends this locality analysis by introducing a compile- time interference analysis that significantly improve the performance of the system.
Abstract: The performance of cache memories relies on the locality exhibited by programs. Traditionally this locality has been divided into temporal and spatial, and conventional caches exploit both types of locality. For the sake of simplicity, cache memories manage all memory references in the same way regardless of their locality. This homogeneous treatment has an important drawback: references either with only temporal locality or without locality bring useless data into cache. This results in an unnecessary increase in memory access time and memory bandwidth, in addition to the penalties introduced by potentially removing useful data from the cache. In a previous paper, we proposed two novel organizations called Selective Cache and Dual Data Cache as an attempt to avoid this loss of performance by managing each memory reference in a different way depending on its type of locality. In those proposals, the locality of each reference was dynamically estimated at run-time based on its history. In this paper, we present a novel design of the Selective and Dual Data Caches in which the locality estimation is based on a locality analysis performed at compile-time. We show that previous proposals on locality analysis are not appropriate when the programs have a high conflict miss ratio. This paper extends this locality analysis by introducing a compile-time interference analysis that significantly improve the performance of the system.

26 citations

01 Jan 2011
TL;DR: The goal of this book is to present an overview of the current state-of-the-art in computer architecture performance evaluation, with a special emphasis on methods for exploring processor architectures.
Abstract: Performance evaluation is at the foundation of computer architecture research and development. Contemporary microprocessors are so complex that architects cannot design systems based on intuition and simple models only. Adequate performance evaluation methods are absolutely crucial to steer the research and development process in the right direction. However, rigorous performance evaluation is non-trivial as there are multiple aspects to performance evaluation, such as picking workloads, selecting an appropriate modeling or simulation approach, running the model and interpreting the results using meaningful metrics. Each of these aspects is equally important and a performance evaluation method that lacks rigor in any of these crucial aspects may lead to inaccurate performance data and may drive research and development in a wrong direction. The goal of this book is to present an overview of the current state-of-the-art in computer architecture performance evaluation, with a special emphasis on methods for exploring processor architectures. The book focuses on fundamental concepts and ideas for obtaining accurate performance data. The book covers various topics in performance evaluation, ranging from performance metrics, to workload selection, to various modeling approaches including mechanistic and empirical modeling. And because simulation is by far the most prevalent modeling technique, more than half the book’s content is devoted to simulation. The book provides an overview of the simulation techniques in the computer designer’s toolbox, followed by various simulation acceleration techniques including sampled simulation, statistical simulation, parallel simulation and hardware-accelerated simulation.

26 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations