scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
07 Jul 2010
TL;DR: It is shown that in situations where not enough resources are available to issue prefetch instructions for all references in a loop, it is more beneficial to decrease the prefetch distance and prefetch for as many references as possible, rather than use a fixed prefetched distance and skip prefetching for some references, as in current approaches.
Abstract: Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi-and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4-16 concurrent cache misses. This disconnect is exacerbated by recent highly parallel architectures (e.g. GPUs) where power and area per-core budget favor lighter cores with less resources. Support for hardware and software prefetch increase MLP pressure since these techniques overlap multiple memory requests with existing computation. In this paper, we propose and evaluate a novel Resource-Aware Prefetching (RAP) compiler algorithm that is aware of the number of simultaneous prefetches supported, and optimized for the same. We show that in situations where not enough resources are available to issue prefetch instructions for all references in a loop, it is more beneficial to decrease the prefetch distance and prefetch for as many references as possible, rather than use a fixed prefetched distance and skip prefetching for some references, as in current approaches. We implemented our algorithm in a GCC-derived compiler and evaluated its performance using an emerging fine-grained many-core architecture. Our results show that the RAP algorithm outperforms a well-known loop prefetching algorithm by up to 40.15% and the state-of-the art GCC implementation by up to 34.79%. Moreover, we compare the RAP algorithm with a simple hardware prefetching mechanism, and show improvements of up to 24.61%.

16 citations

Proceedings ArticleDOI
07 Oct 2012
TL;DR: This paper proposes new L0 data cache organizations using the assumption that an L0 hit/miss determination can be completed prior to the L1 access, which is a realistic assumption for very small L0 caches that can nevertheless deliver significant miss rate and/or energy reduction.
Abstract: Level-0 (L0) caches have been proposed in the past as an inexpensive way to improve performance and reduce energy consumption in resource-constrained embedded processors. This paper proposes new L0 data cache organizations using the assumption that an L0 hit/miss determination can be completed prior to the L1 access. This is a realistic assumption for very small L0 caches that can nevertheless deliver significant miss rate and/or energy reduction. The key issue for such caches is how and when to move data between the L0 and L1 caches. The first new cache, a flow cache, targets a conflict miss reduction in a direct-mapped L1 cache. It offers a simpler hardware design and uses on average 10% less dynamic energy than the victim cache with nearly identical performance. The second new cache, a hit cache, reduces the dynamic energy consumption in a set-associative L1 cache by 30% without impacting performance. A variant of this policy reduces the dynamic energy consumption by up to 50%, with 5% performance degradation.

16 citations

Patent
03 Oct 2002
TL;DR: In this article, a directory maintains status information over memory blocks in a shared memory computer system and is organized into a main region and a write-back region, where the main region has an owner field, identifying the current owner of the block, and the writer field identifies the last owner to have written the block back to memory.
Abstract: A directory maintains status information over memory blocks in a shared memory computer system. The directory has a plurality of entries each corresponding to a respective block, and is organized into a main region and a write-back region. The main region has an owner field, identifying the current owner of the block. The write-back region has a writer field identifying the last owner to have written the block back to memory. To write a block back to memory, the owner enters its identifier in the writer field and writes the data back to memory without checking nor modifying the owner field. In response to a memory operation, if the contents of the owner field and the writer field match, memory concludes that it is the owner, otherwise memory concludes that the entity identified in the owner field is the owner.

16 citations

Journal ArticleDOI
TL;DR: A spatial and temporal locality-aware adaptive cache is proposed, which dynamically partitions the private last level cache bank as prefetch region or victim region at runtime to explore the locality characteristics.
Abstract: The spatial locality and the temporal locality of workloads are the root causes for cache designs to overcome the memory wall problem. However, the real memory access behavior for each of these applications can be very different. It gives the opportunities to explore further performance improvement due to different cache organization requirements. To address this issue, a spatial and temporal locality-aware adaptive cache is proposed, which dynamically partitions the private last level cache bank as prefetch region or victim region at runtime to explore the locality characteristics. The prefetch region speculates the data blocks in subsequent addresses to exploit the spatial locality, while the victim region collects the evicted data blocks from the upper memory hierarchy to exploit the temporal locality. Fast data prefetch with prioritized dynamic buffer management and adaptive burst-aware routing is realized in the proposed hybrid burst-support network-on-chip (HBNoC). By combining the adaptive cache partition with HBNoC, the off-chip misses and the on-chip network usage are greatly reduced. Experimental results demonstrate that the proposed adaptive cache design reduces up to 25% off-chip misses and improves 11.3% performance on average compared with the prior design, respectively.

15 citations

01 Jan 2007
TL;DR: It is shown that implicit operating system awareness within a VMM can be used to implement a variety of useful applications like sophisticated I/O scheduling, flexible memory management, efficient caching, and reliable security monitoring that significantly enhance the value of the virtualization layer.
Abstract: Commodity server and desktop computer systems have become powerful enough in recent years to profitably make use of system virtualization technology. System software vendors are enthusiastically embracing system virtualization to address some of the key issues facing today's enterprises like manageability, rapid service deployment, and disaster recovery. Widespread adoption of virtualization has a disruptive influence on system organization. In a virtualized environment, the virtual machine monitor (VMM) supplants the operating system as the primary resource manager. When a virtualization layer is present, certain system features like resource scheduling, cache management, and security monitoring can often be implemented most naturally within the VMM. While a VMM understands and controls system hardware resources, it currently knows very little about the high-level software abstractions implemented within guest operating systems, a fact referred to as the "semantic gap". Information pertaining to OS constructs like processes, threads, users, and caches is often useful, however, when implementing services at the VMM layer. Hence, researchers have invented ways of directly exporting relevant information from the operating system to an underlying VMM. This direct approach, while effective, has some important drawbacks. For example, it leads to close coupling between VMM-layer services and specific OS vendors and versions, reducing the applicability of services and complicating deployment and management. We have invented and implemented techniques that can be used by a VMM to infer useful information about selected operating system abstractions and achieve a level of implicit operating system awareness. Our approach uses observation of architectural events and the fact that modern operating systems share many basic features and responsibilities. This dissertation describes our techniques in detail and presents the results of a careful experimental evaluation of them. Using case studies, we show that implicit operating system awareness within a VMM can be used to implement a variety of useful applications like sophisticated I/O scheduling, flexible memory management, efficient caching, and reliable security monitoring that significantly enhance the value of the virtualization layer.

15 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations