scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
DOI
01 Jan 2015
TL;DR: This dissertation uses CCWS to demonstrate that improvements to the hardware thread scheduling policy in massively multithreaded systems offer a promising new design space to explore in locality management and demonstrates that DAWS can effectively shift the burden of locality management from software to hardware by increasing the performance of simpler GPUs.
Abstract: Massively parallel processing devices, like Graphics Processing Units (GPUs), have the ability to accelerate highly parallel workloads in an energy-efficient manner. However, executing irregular or less tuned workloads poses performance and energy-efficiency challenges on contemporary GPUs. These inefficiencies come from two primary sources: ineffective management of locality and decreased functional unit utilization. To decrease these effects, GPU programmers are encouraged to restructure their code to fit the underlying hardware architecture which affects the portability of their code and complicates the GPU programming process. This dissertation proposes three novel GPU microarchitecture enhancements for mitigating both the locality and utilization problems on an important class of irregular GPU applications. The first mechanism, Cache-Conscious Warp Scheduling (CCWS), is an adaptive hardware mechanism that makes use of a novel locality detector to capture memory reference locality that is lost by other schedulers due to excessive contention for cache capacity. On cache-sensitive, irregular GPU workloads, CCWS provides a 63% speedup over previous scheduling techniques. This dissertation uses CCWS to demonstrate that improvements to the hardware thread scheduling policy in massively multithreaded systems offer a promising new design space to explore in locality management. The second mechanism, DivergenceAware Warp Scheduling (DAWS), introduces a divergence-based cache footprint predictor to estimate how much L1 data cache capacity is needed to capture locality in loops. We demonstrate that the predictive, pre-emptive nature of DAWS can provide an additional 26% performance improvement over CCWS. This dissertation also demonstrates that DAWS can effectively shift the burden of locality management from software to hardware by increasing the performance of simpler

5 citations

Dissertation
01 Jan 2010
TL;DR: This work attacks a DRAM macro made by Hwang et.
Abstract: Much of the chip area and power in a modern processor are used by mechanisms that compensate for slow main memory such as caches, out-of-order execution and prefetching. We attack this problem by utilizing a DRAM macro made by Hwang et. al that is faster than conventional DRAM macros, but does not conserve data in the DRAM cells after reading. Their prototype included a large write-back buffer for conserving data without degrading performance of read accesses. We eliminate this buffer by utilizing the already existing cache in processor designs at the cost of potential memory bank congestion. Two implementable and one theoretic upper-bound scheme for cache write-back are evaluated. We find that the size of the cache can be highly reduced without degrading performance when utilizing destructive-read DRAM. The large write-back buffer can be omitted when destructive-read DRAM is used with a processor with cache without significant degradation of performance.

5 citations

Book ChapterDOI
13 Sep 2006
TL;DR: In this paper, a Data Parallel Coprocessor architecture based on SDTA and architecture de-cisions are made for the best performance/cost ratio to run many embedded media processing applications.
Abstract: The widely adoption of media processing applications provides great challenges to high performance embedded processor design. This paper studies a Data Parallel Coprocessor architecture based on SDTA and architecture de-cisions are made for the best performance/cost ratio. Experimental results on a prototype show that SDTA has high performance to run many embedded media processing applications. The simplicity and flexibility of SDTA encourages for further development for its reconfigurable functionality.

5 citations

24 Aug 1994
TL;DR: This paper proposes to analyze the tradeoos involved in designing a multiprocessor system on a single chip, and addresses whether to allocate available chip area to larger caches or to large numbers of processors, and shows that adding processors at the expense of cache size improves performance up to a point.
Abstract: By the end of the decade, as VLSI integration levels continue to increase, building a multiprocessor system on a single chip will become feasible. In this paper, we propose to analyze the tradeoos involved in designing such a chip, and speciically address whether to allocate available chip area to larger caches or to large numbers of processors. Using the dimensions of the Alpha 21064 microprocessor as a basis, we determine several candidate conngurations which vary in cache size and number of processors, and evaluate them in terms of both processing power and cycle time. We then investigate ne tuning the architecture in order to further improve performance, by trading oo the number of processors for a larger TLB size. Our results show that for a coarse-grain execution environment, adding processors at the expense of cache size improves performance up to a point. We then show that increasing TLB size at the expense of the number of processors can further improve performance.

5 citations

01 Jan 2007
TL;DR: This paper proposes a prefetching framework where an L1-levelPrefetcher and an L2level prefetcher work cooperatively to reduce the average access time more than either one alone can, and proposes a confidence-based adaptive prefetchers that can improve prefetch efficiency significantly with judicious use of available bus bandwidth.
Abstract: Data prefetching has been shown to be an effective tool in hiding part of the latency associated with cache misses in modern processors. Traditionally, data prefetchers fetch data into a small prefetch buffer near the L1 for low latency, or the L2 cache for greater coverage and less cache pollution. However, with the L1–L2 cache speed gap growing, significant performance gains can be obtained if the data prefetcher can operate as aggressively as an L2-level prefetcher but with the fast hit times of an L1-level prefetcher. In this paper, we propose a prefetching framework where an L1-level prefetcher and an L2level prefetcher work cooperatively to reduce the average access time more than either one alone can. We evaluate several design alternatives suited to perform synergistically under different workloads. From the insight we gather from this analysis, we propose a confidence-based adaptive prefetcher that can improve prefetch efficiency significantly with judicious use of available bus bandwidth. Our results show that for certain prefetcher combinations, twolevel prefetching can achieve the cumulative speedup attained from either prefetcher alone. Furthermore, when compared to other two-level prefetching models, the adaptive design provides similar speedups with appreciably less bus traffic.

5 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations