scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: In this article, three different arbitration policies are presented, evaluated, and compared with respect to their real-time and average-case performance: a fixed priority, a fair-based, and a time-sliced arbiter.
Abstract: Chip-multiprocessors are an emerging trend for embedded systems. In this article, we introduce a real-time Java multiprocessor called JopCMP. It is a symmetric shared-memory multiprocessor, and consists of up to eight Java Optimized Processor (JOP) cores, an arbitration control device, and a shared memory. All components are interconnected via a system on chip bus. The arbiter synchronizes the access of multiple CPUs to the shared main memory. In this article, three different arbitration policies are presented, evaluated, and compared with respect to their real-time and average-case performance: a fixed priority, a fair-based, and a time-sliced arbiter.Tasks running on different CPUs of a chip-multiprocessor (CMP) influence each others' execution times when accessing a shared memory. Therefore, the system needs an arbiter that is able to limit the worst-case execution time of a task running on a CPU, even though tasks executing simultaneously on other CPUs access the main memory. Our research shows that timing analysis is in fact possible for homogeneous multiprocessor systems with a shared memory. The timing analysis of tasks, executing on the CMP using time-sliced memory arbitration, leads to viable worst-case execution time bounds.The time-sliced arbiter divides the memory access time into equal time slots, one time slot for each CPU. This memory arbitration scheme allows for a calculation of upper bounds of Java application worst-case execution times, depending on the number of CPUs, the time slot size, and the memory access time. Examples of worst-case execution time calculation are presented, and the analyzed results of a real-world application task are compared to measured execution time results. Finally, we evaluate the tradeoffs when using a time-predictable solution compared to using average-case optimized chip-multiprocessors, applying three different benchmarks. These experiments are carried out by executing the programs on the CMP prototype.

78 citations

Proceedings ArticleDOI
10 Feb 2007
TL;DR: It is shown that cache and link (off-chip interconnect) compression can increase the effective cache capacity and increase theeffective off-chip bandwidth and reduce contention in chip multiprocessors.
Abstract: In chip multiprocessors (CMPs), multiple cores compete for shared resources such as on-chip caches and off-chip pin bandwidth. Stride-based hardware prefetching increases demand for these resources, causing contention that can degrade performance (up to 35% for one of our benchmarks). In this paper, we first show that cache and link (off-chip interconnect) compression can increase the effective cache capacity (thereby reducing off-chip misses) and increase the effective off-chip bandwidth (reducing contention). On an 8-processor CMP with no prefetching, compression improves performance by up to 18% for commercial workloads. Second, we propose a simple adaptive prefetching mechanism that uses cache compressions extra tags to detect useless and harmful prefetches. Furthermore, in the central result of this paper, we show that compression and prefetching interact in a strong positive way, resulting in combined performance improvement of 10-51% for seven of our eight workloads

77 citations

Patent
Sharad Mehrotra1
21 Jan 1998
TL;DR: In this article, a multi-level cache and method for operation thereof is presented for processing multiple cache system accesses simultaneously and handling the interactions between the queues of the cache levels, where controller logic is also provided for controlling interaction between the miss queue and the write queue.
Abstract: A multi-level cache and method for operation thereof is presented for processing multiple cache system accesses simultaneously and handling the interactions between the queues of the cache levels. The cache unit includes a non-blocking cache receiving data access requests from a functional unit in a processor, and a miss queue storing entries corresponding to data access requests not serviced by the non-blocking cache. A victim queue stores entries of the non-blocking cache which have been evicted from the non-blocking cache, while a write queue buffers write requests into the non-blocking cache. Controller logic is provided for controlling interaction between the miss queue and the victim queue. Controller logic is also provided for controlling interaction between the miss queue and the write queue. Controller logic is also provided for controlling interaction between the victim queue and the miss queue for processing cache misses.

77 citations

Journal ArticleDOI
11 Jun 2006
TL;DR: A novel profile-based analysis for producing a cacheconscious coallocation of a hot data stream H that generalizes common data restructuring techniques, such as field reordering, object splitting, and object merging, and allows their combination.
Abstract: The memory system performance of many programs can be improved by coallocating contemporaneously accessed heap objects in the same cache block. We present a novel profile-based analysis for producing such a layout. The analysis achieves cacheconscious coallocation of a hot data stream H (i.e., a regular data access pattern that frequently repeats) by isolating and combining allocation sites of object instances that appear in H such that intervening allocations coming from other sites are separated. The coallocation solution produced by the analysis is enforced by an automatic tool, cminstr, that redirects a program's heap allocations to a run-time coallocation library comalloc. We also extend the analysis to coallocation at object field granularity. The resulting field coallocation solution generalizes common data restructuring techniques, such as field reordering, object splitting, and object merging, and allows their combination. Furthermore, it provides insight into object restructuring by breaking down the coallocation benefit on a per-technique basis, which provides the opportunity to pick the "sweet spot" for each program. Experimental results using a set of memory-performance-limited benchmarks, including a few SPECInt2000 programs, and Microsoft VisualFoxPro, indicate that programs possess significant coallocation opportunities. Automatic object coallocation improves execution time by 13% on average in the presence of hardware prefetching. Hand-implemented field coallocation solutions for two of the benchmarks produced additional improvements (12% and 22%) but the effort involved suggests implementing an automated version for type-safe languages, such as Java and C#.

76 citations

Journal ArticleDOI
TL;DR: A few simple guidelines for revising a developed algorithm in order to increase the utilization of the cache are proposed and optimization techniques, including particle swarm optimization and the genetic algorithm, are employed.

76 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations