scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Book ChapterDOI
01 Jan 2004
TL;DR: This chapter proposes an asymmetric cache structure in which the size of each way can be different, which achieves performance comparable to a conventional cache of similar size and equal associativity.
Abstract: Data caches are widely used in general-purpose processors as a means to hide long memory latencies. Set-associativity in these caches helps programs avoid performance problems due to cache-mapping conflicts. Current set-associative caches are symmetric in the sense that each way has the same number of cache lines. Moreover, each way is searched in parallel so energy is consumed by all ways even though at most one way will hit. With this in mind, this chapter proposes an asymmetric cache structure in which the size of each way can be different. The ways of the cache are different powers of two and allow for a “tree-structured” cache in which extra associativity can be shared. We accomplish this by having two cache blocks from the larger ways align with individual cache blocks in the smaller ways. This structure achieves performance comparable to a conventional cache of similar size and equal associativity. Most notably, the asymmetric cache has the nice property that accesses hit in the smaller ways can immediately terminate accesses to larger ways so that power can be saved. For the SPEC2000 benchmarks, we found cache energy per access was reduced by as much as 23% on average. The characteristics of the asymmetric set-associative design (lowpower, uncompromised performance, compact layout) make them particularly attractive for low-power processors.

6 citations

Dissertation
01 Jan 2007
TL;DR: A split data cache architecture is proposed that will group memory accesses as scalar or array references according to their inherent locality and will subsequently map each group to a dedicated cache partition, to reduce area and power consumed by cache memories while retaining performance gains.
Abstract: Existing cache organization suffers from the inability to distinguish different types of localities, and non-selectively cache all data rather than making any attempt to take special advantage of the locality type. This causes unnecessary movement of data among the levels of the memory hierarchy and increases in miss ratio. In this dissertation I propose a split data cache architecture that will group memory accesses as scalar or array references according to their inherent locality and will subsequently map each group to a dedicated cache partition. In this system, because scalar and array references will no longer negatively affect each other, cache-interference is diminished, delivering better performance. Further improvement is achieved by the introduction of victim cache, prefetching, data flattening and reconfigurability to tune the array and scalar caches for specific application. The most significant contribution of my work is the introduction of novel cache architecture for embedded microprocessor platforms. My proposed cache architecture uses reconfigurability coupled with split data caches to reduce area and power consumed by cache memories while retaining performance gains. My results show excellent reductions in both memory size and memory access times, translating into reduced power consumption. Since there was a huge reduction in miss rates at L-1 caches, further power reduction is achieved by partially or completely shutting down L-2 data or L-2 instruction caches. The saving in cache sizes resulting from these designs can be used for other processor activities including instruction and data prefetching, branch-prediction buffers. The potential benefits of such techniques for embedded applications have been evaluated in my work. I also explore how my cache organization performs for non-numeric data structures. I propose a novel idea called “Data flattening” which is a profile based memory allocation technique to compress sparsely scattered pointer data into regular contiguous memory locations and explore the potentials of my proposed Spit cache organization for data treated with data flattening method.

6 citations

Book ChapterDOI
Gi-Ho Park1, Kil-Whan Lee1, Jae Hyuk Lee1, Tack-Don Han1, Shin-Dug Kim1 
18 Jun 2000
TL;DR: In this paper, a cooperative cache system consisting of two caches, i.e., a direct-mapped temporal oriented cache and a four-way set-associative spatial oriented cache, is proposed.
Abstract: A dual data cache system structure, called a cooperative cache system, is designed as a low power cache structure for embedded processors. The cooperative cache system consists of two caches, i.e., a direct-mapped temporal oriented cache (TOC) and a four-way set-associative spatial oriented cache (SOC). These two caches are constructed with different block sizes as well as associativities. The block size of the TOC is 8bytes and that of the SOC is 32bytes, and the capacity of each cache is 8Kbytes. The cooperative cache system achieves improvement in performance and reduces power consumption by virtue of the structural characteristics of the two caches designed inherently to help each other. The cooperative cache system is adopted as the cache structure for the CalmRISC-32 embedded processor that is going to be manufactured by Samsung Electronics Co. with 0.25µm technology.

6 citations

Proceedings ArticleDOI
11 Nov 1997
TL;DR: Addresses the use of two latency hiding techniques, prefetching and weak consistency, for large-scale shared memory multiprocessors with compiler-controlled cache coherence management and the interaction of latency hide techniques and network bandwidth.
Abstract: Addresses the use of two latency hiding techniques, prefetching and weak consistency, for large-scale shared memory multiprocessors with compiler-controlled cache coherence management and the interaction of latency hiding techniques and network bandwidth. The performance effect of latency hiding is evaluated and compared varying the network channel bandwidth. The interaction of reads, writes and prefetches, given a limited bandwidth, is studied, and an approach to better network bandwidth utilization by limiting the number of outstanding requests in each node is investigated. Increasing network (channel) bandwidth helps both prefetching and non-prefetching systems, with the initial 2/spl times/ increase in bandwidth giving the most improvement. The use of prefetching can deliver a much larger improvement than increasing network bandwidth for a 128-processor system for some benchmarks, even with the minimal bandwidth. Controlling bandwidth utilization is shown to be important when prefetch and write request rates are high.

6 citations

Journal ArticleDOI
TL;DR: This brief proposes to use a recent type of neural networks as a novel way to implement associative memories, owing to an efficient retrieval algorithm guided by the information being searched, they are a good candidate for low-power associative memory.
Abstract: Traditional memories use an address to index the stored data. Associative memories rely on a different principle: Part of previously stored data are used to retrieve the remaining part. They are widely used, for instance, in network routers for packet forwarding. A classical way to implement such memories is content-addressable memory (CAM). Since its operation is fully parallel, the response is obtained in a single clock cycle. However, this comes at the cost of energy consumption. This brief proposes to use a recent type of neural networks as a novel way to implement associative memories. Owing to an efficient retrieval algorithm guided by the information being searched, they are a good candidate for low-power associative memory. Compared to the CAM-based system, the analog implementation of 12-kb neuro-inspired memory designed for 65-nm CMOS technology offers 48% energy savings.

6 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations