scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
27 Mar 1996
TL;DR: The motivation and initial design of the Multiprocessor with Reconfigurable Parallel Hardware (MORPH) system is reported, allowing direct support for application-specified patterns, objects, and structures.
Abstract: Achieving 100 TeraOps performance within a ten-year horizon will require massively-parallel architectures that exploit both commodity software and hardware technology for cost efficiency. Increasing clock rates and system diameter in clock periods will make efficient management of communication and coordination increasingly critical. Configurable logic presents a unique opportunity to customize bindings, mechanisms, and policies which comprise the interaction of processing, memory, I/O and communication resources. This programming flexibility, or customizability, can provide the key to achieving robust high performance. The Multiprocessor with Reconfigurable Parallel Hardware (MORPH) uses reconfigurable logic blocks integrated with the system core to control policies, interactions, and interconnections. This integrated configurability can improve the performance of local memory hierarchy, increase the efficiency of interprocessor coordination, or better utilize the network bisection of the machine. MORPH provides a framework for exploring such integrated application-specific customizability. Rather than complicate the situation, MORPH's configurability supports component software and interoperability frameworks, allowing direct support for application-specified patterns, objects, and structures. This paper reports the motivation and initial design of the MORPH system.

9 citations

01 Jan 2014
TL;DR: A thread group-aware stride prefetching scheme for GPUs that predicts addresses with over 99.27% accuracy and is able to improve GPU performance by 10%.
Abstract: In this paper, we propose a thread group-aware stride prefetching scheme for GPUs. GPU kernels group threads into cooperative thread arrays (CTAs). Each thread typically uses its thread index and its associated CTA index to identify the data that it operates on. The starting base address accessed by the first warp in a CTA is difficult to predict, since that starting address is a complex function of thread index and CTA index and also depends on how the programmer distributes input data across CTAs. But threads within each CTA exhibit stride accesses. Hence, if the base address of each CTA can be computed early, it is possible to accurately predict prefetch addresses for threads within a CTA. To compute the base address of each CTA, a leading warp is used from each CTA. The leading warp is executed early by pairing it with warps from currently executing leading CTA. The warps in the leading CTA are used to compute the stride value. The stride value is then combined with base addresses computed from the leading warp of each CTA to prefetch the data for all the trailing warps in the trailing CTAs. Through simple enhancements to the existing two-level scheduler, prefetches can be issued sufficiently ahead of time before the demand requests. CTA-aware prefetch predicts addresses with over 99.27% accuracy and is able to improve GPU performance by 10%.

9 citations

Proceedings ArticleDOI
Gi-Ho Park1, Oh-Young Kwon1, Tack-Don Han1, Shin-Dug Kim, Sung-Bong Yang 
28 Apr 1997
TL;DR: A lookahead prefetching scheme that fetches multiple blocks for a prefetch request and adopts prefetch on miss mechanism is proposed to achieve balanced improvement of the two factors.
Abstract: A new lookahead instruction prefetching mechanism is proposed in this paper. Though significant performance improvement can be obtained by improving both the cache miss ratio and average access time for successfully prefetched blocks, most conventional prefetching mechanisms improve only one out of the two factors. To achieve balanced improvement of the two factors, a lookahead prefetching scheme that fetches multiple blocks for a prefetch request and adopts prefetch on miss mechanism is proposed. Performance evaluation is carried out through the trace-driven simulation and the proposed prefetch scheme reduces 32%/spl sim/56% of the memory access delay time of the cache system that does not perform any prefetching.

9 citations

Patent
06 Nov 2000
TL;DR: The register file of a processor includes embedded operand queues as mentioned in this paper, and the configuration of the register file into registers and queues is defined dynamically by a computer program, where the programmer determines the trade-off between the number and size of the operand queue(s) versus the number of registers used for the program.
Abstract: The register file of a processor includes embedded operand queues. The configuration of the register file into registers and operand queues is defined dynamically by a computer program. The programmer determines the trade-off between the number and size of the operand queue(s) versus the number of registers used for the program. The programmer partitions a portion of the registers into one or more operand queues. A given queue occupies a consecutive set of registers, although multiple queues need not occupy consecutive registers. An additional address bit is included to distinguish operand queue addresses from register addresses. Queue state logic tracks status information for each queue, including a header pointer, tail pointer, start address, end address and number of vacancies value. The program sets the locations and depth of a given operand queue within the register file.

9 citations

Proceedings ArticleDOI
07 Nov 2002
TL;DR: An interleaved-block order data layout to improve cache performance and an algorithm to explicitly prefetch macroblocks for motion compensation are described, which improve the performance of an already optimized software MPEG-2 decoder by about a factor of two.
Abstract: This paper shows that the performance bottleneck in software MPEG-2 video decoders has shifted to memory operations, as microprocessor technologies have been improving at a fast rate during the past few years. We exploit concurrencies between the processor and the memory sub-system at macroblock level to alleviate the performance bottleneck. First, the paper introduces an interleaved-block order data layout to improve cache performance. Second, the paper describes an algorithm to explicitly prefetch macroblocks for motion compensation. Finally, the paper presents an algorithm to schedule interleaved decoding and output at macroblock level. Our implementation and experiments show that these methods successfully hide the latency of memory and frame buffer. These techniques improve the performance of an already optimized software MPEG-2 decoder by about a factor of two. On a 933 MHz Pentium III PC, the decoder can play 720p HDTV streams at over 62 frames per second.

9 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations