scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
01 Jan 2011
TL;DR: This thesis demonstrates that cache utilisation is relatively poor over a wide range of benchmarks and cache configurations, and presents a variety of such predictors, mostly based upon the mature field of branch prediction, and compares them against previously proposed predictors.
Abstract: Microprocessors have long employed caches to help hide the increasing latency of accessing main memory. The vast majority of previous research has focussed on increasing cache hit rates to improve cache performance, while lately decreasing power consumption has become an equally important issue. This thesis examines the lifetime of cache lines in the memory hierarchy, considering whether they are live (will be referenced again before eviction) or dead (will not be referenced again before eviction). Using these two states, the cache utilisation (proportion of the cache which will be referenced again) can be calculated. This thesis demonstrates that cache utilisation is relatively poor over a wide range of benchmarks and cache configurations. By focussing on techniques to improve cache utilisation, cache hit rates are increased while overall power consumption may also be decreased. Key to improving cache utilisation is an accurate predictor of the state of a cache line. This thesis presents a variety of such predictors, mostly based upon the mature field of branch prediction, and compares them against previously proposed predictors. The most appropriate predictors are then demonstrated in two applications: • Improving victim cache performance through filtering • Reducing cache pollution during aggressive prefetching These applications are primarily concerned with improving cache performance and are analysed using a detailed microprocessor simulator. Related applications, including decreasing power consumption, are also discussed, as are the applicability of these techniques to multiprogrammed and multiprocessor systems.

7 citations

Patent
23 Nov 2009
TL;DR: In this paper, an external agent requests data from the memory system of a computer system at a target address, and a snoop cache determines if the target address is within an address range known to be safe for external access.
Abstract: Methods and systems for efficiently processing direct memory access requests coherently. An external agent requests data from the memory system of a computer system at a target address. A snoop cache determines if the target address is within an address range known to be safe for external access. If the snoop cache determines that the target address is safe, the external agent proceeds with the direct memory access. If the snoop cache does not determine if the target address is safe, then the snoop cache forwards the request on to the processor. After the processor resolves any coherency problems between itself and the memory system, the processor signals the external agent to proceed with the direct memory access. The snoop cache can determine safe address ranges from such processor activity. The snoop cache invalidates its safe address ranges by observing traffic between the processor and the memory system.

7 citations

Journal ArticleDOI
TL;DR: Results show that the proposed Triangular D-NUCA Cache is particular useful in the embedded application domain, as it permits the utilization of half-sized NUCA cache with performance improvements.
Abstract: Future embedded applications will require high performance processors integrating fast and low-power cache. Dynamic Non-Uniform Cache Architectures (D-NUCA) have been proposed to overcome the performance limit introduced by wire delays when designing large cache. In this paper, we propose an alternative design of D-NUCA cache, namely Triangular D-NUCA Cache, to reduce power consumption and silicon area occupancy of D-NUCA cache. We compare the performances of Triangular D-NUCA cache with the ones achieved by conventional rectangular organization. Results show that our approach is particular useful in the embedded application domain, as it permits the utilization of half-sized NUCA cache with performance improvements.

7 citations

Proceedings ArticleDOI
17 Feb 2010
TL;DR: This work shows how to reduce the impact of prefetching techniques in terms of power (and energy) consumption in the context of tiled CMPs by using a heterogeneous interconnect, where low-power wires are used for dealing with prefetched lines, significant energy savings can be obtained.
Abstract: In the last years high performance processor designs have evolved toward Chip-Multiprocessor (CMP) architectures that implement multiple processing cores on a single die. As the number of cores inside a CMP increases, the on-chip interconnection network will have significant impact on both overall performance and power consumption as previous studies have shown. On the other hand, CMP designs are likely to be equipped with latency hiding techniques like hardware prefetching in order to reduce the negative impact on performance that, otherwise, high cache miss rates would lead to. Unfortunately, the extra number of network messages that prefetching entails can drastically increase the amount of power consumed in the interconnect. In this work, we show how to reduce the impact of prefetching techniques in terms of power (and energy) consumption in the context of tiled CMPs. Our proposal is based on the fact that the wires used in the on-chip interconnection network can be designed with varying latency, bandwidth and power characteristics. By using a heterogeneous interconnect, where low-power wires are used for dealing with prefetched lines, significant energy savings can be obtained. Detailed simulations of a 16-core CMP show that our proposal obtains improvements of up to 30% in the power consumed by the interconnect (15-23% on average) with almost negligible cost in terms of execution time (average degradation of 2%).

7 citations

01 Jan 2000
TL;DR: To enable effective use of caches in a multithreaded environ- ment (giving high execution speed even in the context of high memory latencies), this work proposes to use a cache architecture where the cache can be divided into partitions.
Abstract: Multithreaded architectures have been developed as a way to hide latencies in memory access, communication, and long pipelines. Caches have been developed to hide latencies and reduce memory bandwidth requirements. Caches do not work well in multithreaded environments, because threads unintentionally evict each others data and instructions. To enable effective use of caches in a multithreaded environ- ment (giving high execution speed even in the context of high memory latencies), we propose to use a cache architecture where the cache can be divided into partitions. Each thread is assigned a set of partitions which are used to cache a view of data structures, or part of the instruction stream. The partition assignment is completely automated in the compiler. With our compiler and architecture, all forms of interfer- ence are eliminated and predictable execution of multithreaded programs is achieved in moderately sized caches. 1 Microprocessor Caches Currently, both uniprocessor and multiprocessor computer architectures rely heavily on the use of caches. A cache is a small area of fast memory placed between the processor and main memory. The cache aims to exploit temporal and spatial locality of the application code and its data. A limiting problem in multithreaded architectures is cache interference between threads of execution. As accesses to data objects are interleaved the location of several data objects (and instruction segments) may map to the same area of the cache, the threads will compete for cache space and each thread may displace information that is useful elsewhere. This inter- ference will result in degraded performance, a problem that is exacerbated by the increasing speed-gap between memory and processor. Intuitive steps to solve this problem such as in- creasing the cache size or associativity are not always effective and seldom scale well as task complexity increases. The following two programs are a faithful example of how interference occurs. Depend- ing on how the data objects are arranged in memory, it is possible for them to compete for cache space with data objects referenced in the same thread, and with objects referenced in other threads. As the number and complexity of threads increases, this effect becomes harder to predict and model.

7 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations