scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: ArchExplorer is a Web-based permanent and open design-space exploration framework that lets researchers compare their designs against others and explore the design space of an on-chip memory subsystem and a multicore processor.
Abstract: Growing architectural complexity and stringent time-to-market constraints suggest the need to move architecture design beyond parametric exploration to structural exploration. ArchExplorer is a Web-based permanent and open design-space exploration framework that lets researchers compare their designs against others. The authors demonstrate their approach by exploring the design space of an on-chip memory subsystem and a multicore processor.

9 citations

Proceedings ArticleDOI
20 Oct 2018
TL;DR: A systematic analysis to show that existing CPU optimizations targeting scientific/server workloads are not always well suited for mobile apps, and proposes the concept of Critical Instruction Chains (CritICs) – which are short, critical and self contained sequences of instructions, for aggregate level optimization.
Abstract: In this paper, we conduct a systematic analysis to show that existing CPU optimizations targeting scientific/server workloads are not always well suited for mobile apps. In particular, we observe that the well-known and very important concept of identifying and accelerating individual critical instructions in workloads such as SPEC, are not as effective for mobile apps. Several differences in mobile app characteristics including (i) dependencies between critical instructions interspersed with non-critical instructions in the dependence chain, (ii) temporal proximity of the critical instructions in the dynamic stream, and (iii) the bottleneck shifting to the front from the rear of the datapath pipeline, are key contributors to the ineffectiveness of traditional criticality based optimizations. Instead, we propose the concept of Critical Instruction Chains (CritICs) – which are short, critical and self contained sequences of instructions, for aggregate level optimization. With motivating results, we show that an offline profiler/analysis framework can easily identify these CritICs, and we propose a very simple software mechanism in the compiler that exploits ARM’s 16-bit ISA format to nearly double the fetch bandwidth of these instructions. We have implemented this entire framework - both profiler and compiler passes, and evaluated its effectiveness for 10 popular apps from the Play Store. Experimental evaluations show that our approach is much more effective than two previously studied criticality optimizations, yielding a speedup of 12.65%, and energy savings of 15% in the CPU (translating to a system wide energy savings of 4.6%), requiring very little additional hardware support.

9 citations

Proceedings ArticleDOI
24 Jan 2011
TL;DR: Cache Equalizer decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences, a novel distributed cache management scheme for large-scale chip multiprocessors (CMPs).
Abstract: This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large-scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets' usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. Simulation results using a full-system simulator demonstrate that CE achieves an average L2 miss rate reduction of 13.6% over a shared NUCA scheme and by as much as 46.7% for the benchmark programs we examined. Furthermore, evaluations showed that CE outperforms related cache designs.

9 citations

Dissertation
01 Jan 1995
TL;DR: This work will examine the issues confronting the designer of floating-point units for high-performance microprocessors specifically by determining the mechanisms through which afloating-point unit can stall instruction execution, and by describing the implementation and verification of a GaAs floating- point design.
Abstract: : This work examines the issues confronting the designer of floating-point units for high-performance microprocessors. Sophisticated hardware coprocessors for floating-point arithmetic have been pursued primarily within the past decade. The development of these coprocessors parallels that of integer processors; initially simple designs were altered to satisfy the demand for increased performance. Architectural optimizations and technology improvements have had the greatest effect on performance. This work will examine these issues specifically by determining the mechanisms through which a floating-point unit can stall instruction execution, and by describing the implementation and verification of a GaAs floating-point design. This work represents a unique, comprehensive, and accessible study of important issues for supporting high-performance floating-point execution. The culmination of this work has been the design of an IEEE-754 compliant double precision floating-point unit; the chip was designed in a 1.0 micrometer GaAs direct-coupled FET logic process. Most of the conclusions regarding architectural optimizations are independent of technology, though a number of trade-offs in the design were made within the constraints of integration levels, fanin, fanout, logic topologies, speed, and power of GaAs direct-coupled FET logic. The final FPU achieves a high level of performance that exceeds many current leading commercial processors.

9 citations

Proceedings ArticleDOI
17 Jul 2006
TL;DR: This paper proposes a cache architecture, for the instruction cache, that is a modification of the hotspot architecture that has both a faster access time and less energy consumption compared to both the filter cache and the hotspots cache architectures.
Abstract: The cache memory plays a crucial role in the performance of any processor. The cache memory (SRAM), especially the on chip cache, is 3-4 times faster than the main memory (DRAM). It can vastly improve the processor performance and speed. Also the cache consumes much less energy than the main memory. That leads to a huge power saving which is very important for embedded applications. In today's processors, although the cache memory reduces the energy consumption of the processor, however the energy consumption in the on-chip cache account to almost 40% of the total energy consumption of the processor. In this paper, we propose a cache architecture, for the instruction cache, that is a modification of the hotspot architecture. Our proposed architecture consists of a small filter cache in parallel with the hotspot cache, between the L1 cache and the main memory. The small filter cache is to hold the code that was not captured by the hotspot cache. We also propose a prediction mechanism to steer the memory access to either the hotspot cache, the filter cache, or the L1 cache. Our design has both a faster access time and less energy consumption compared to both the filter cache and the hotspot cache architectures. We use Mibench and Mediabench benchmarks, together with the simplescalar simulator in order to evaluate the performance of our proposed architecture and compares it with the filter cache and the hotspot cache architectures. The simulation results show that our design outperforms both the filter cache and the hotspot cache in both the average memory access time and the energy consumption

9 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations