scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

01 May 1990-Vol. 18, pp 364-373
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This work couple the exploration of memory modules together with their connectivity, to evaluate a wide range of cost, performance, and energy connectivity architectures, and uses a heuristic to prune the design space, guiding the exploration towards the most promising designs.
Abstract: Memory accesses represent a major bottleneck in embedded systems power and performance. Traditionally, designers tried to alleviate this problem by relying on a simple cache hierarchy, or a limited use of special purpose memory modules such as stream buffers. Although real-life applications contain a large number of memory references to a diverse set of data structures, a significant percentage of all memory accesses in the application are generated from a few memory instructions that exhibit predictable, well-known access patterns; this creates an opportunity for memory customization, targeting the needs of these access patterns. We present APEX, an approach that extracts, analyzes and clusters the most active access patterns in the application, and aggressively customizes the memory architecture to match the needs of the application. Moreover, though the memory modules are important, the rate at which the memory system can produce the data for the CPU is significantly impacted by the connectivity architecture between the memory subsystem and the CPU. Thus, it is critical to consider the connectivity architecture early in the design flow, in conjunction with the memory architecture. We couple the exploration of memory modules together with their connectivity, to evaluate a wide range of cost, performance, and energy connectivity architectures. We use a heuristic to prune the design space, guiding the exploration towards the most promising designs. We present experiments on a set of large real-life benchmarks, showing significant performance improvements for varied cost and power characteristics, allowing the designer to evaluate customized memory and connectivity configurations for embedded systems.

10 citations

Journal ArticleDOI
TL;DR: A novel cache organization of a balanced cache, which balances accesses to cache sets at the granularity of cache subarrays is proposed, which requires only one cycle to access all cache hits and has the same access time as direct-mapped caches.
Abstract: It is observed that the limited memory space of direct-mapped caches is not used in balance therefore incurs extra conflict misses. We propose a novel cache organization of a balanced cache, which balances accesses to cache sets at the granularity of cache subarrays. The key technique of the balanced cache is a programmable subarray decoder through which the mapping of memory reference addresses to cache subarrays can be optimized hence conflict misses of direct-mapped caches can be resolved. The experimental results show that the miss rate of balanced cache is lower than that of the same sized two-way set-associative caches on average and can be as low as that of the same sized four-way set-associative caches for particular applications. Compared with previous techniques, the balanced cache requires only one cycle to access all cache hits and has the same access time as direct-mapped caches

10 citations

Proceedings Article
30 Apr 2020
TL;DR: This work proposes a new approach to use GNNs to learn fused representations of general source code and its execution that leads to improved performance over similar methods that do not use execution and opens the door to applying GNN models to new tasks that would not be feasible from static code alone.
Abstract: As the performance of computer systems stagnates due to the end of Moore’s Law, there is a need for new models that can understand and optimize the execution of general purpose code. While there is a growing body of work on using Graph Neural Networks (GNNs) to learn static representations of source code, these representations do not understand how code executes at runtime. In this work, we propose a new approach using GNNs to learn fused representations of general source code and its execution. Our approach defines a multi-task GNN over low-level representations of source code and program state (i.e., assembly code and dynamic memory states), converting complex source code constructs and data structures into a simpler, more uniform format. We show that this leads to improved performance over similar methods that do not use execution and it opens the door to applying GNN models to new tasks that would not be feasible from static code alone. As an illustration of this, we apply the new model to challenging dynamic tasks (branch prediction and prefetching) from the SPEC CPU benchmark suite, outperforming the state-of-the-art by 26% and 45% respectively. Moreover, we use the learned fused graph embeddings to demonstrate transfer learning with high performance on an indirectly related algorithm classification task.

10 citations

DissertationDOI
01 Jan 1994
TL;DR: This thesis presents a new hardware prefetching scheme based on dynamic interpretation of the instruction stream that performs well, yielding reductions in stall cycles up to 30% or more for both scientific and general purpose programs, and has been shown to reduce the number of instruction stall cycles as compared to the OBL technique.
Abstract: It is well known that memory latency is a major deterrent to achieving the maximum possible performance of a today's high speed RISC processors. Techniques to reduce or tolerate large memory latencies become essential for achieving high processor utilization. Many methods, ranging from software to hardware solutions, have been studied with varying amounts of success. Most techniques have concentrated on data prefetching. However, our simulations show that the CPU is stalled up to 50% of the time waiting for instructions. The instruction memory latency reduction technique typically used in CPU designs today is the one block look-ahead (OBL) method. In this thesis, I present a new hardware prefetching scheme based on dynamic interpretation of the instruction stream. This is done by adding a small pipeline to the cache that scans forward in the instruction stream interpreting each instruction and predicting the future execution path. It then prefetches what it predicts the CPU will be executing in the near future. The pipelined prefetching engine has been shown to be a very effective technique for decreasing the instruction stall cycles in typical on-chip cache memories used today. It performs well, yielding reductions in stall cycles up to 30% or more for both scientific and general purpose programs, and has been shown to reduce the number of instruction stall cycles as compared to the OBL technique as well. The idea of sub-line prefetching was also studied and presented. It was thought that prefetching full cache lines might present too much overhead in terms of bus bandwidth, so prefetches should only fill partial cache lines instead. However it was determined that prefetching partial cache lines does not show any benefit when dealing with cache lines smaller than 128 bytes.

10 citations

Journal Article
TL;DR: In this paper, the authors show that the deterministic competitive ratio for the (n, k)-companion cache is (n+1)(k+1)-1, and the randomized competitive ratio is O(log n log k) and Ω( log n + log k).
Abstract: This paper is concerned with online caching algorithms for the (n, k)-companion cache, defined by Brehob et. al. [3]. In this model the cache is composed of two components: a k-way set-associative cache and a companion fully-associative cache of size n. We show that the deterministic competitive ratio for this problem is (n+1)(k+1)-1, and the randomized competitive ratio is O(log n log k) and Ω(log n + log k).

10 citations

References
More filters
Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

01 Jan 1990
TL;DR: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems to conclude that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware.
Abstract: This note evaluates several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The overall conclusion is that operating system performance does not seem to be improving at the same rate as the base speed of the underlying hardware. Copyright  1989 Digital Equipment Corporation d i g i t a l Western Research Laboratory 100 Hamilton Avenue Palo Alto, California 94301 USA

467 citations

Journal ArticleDOI
01 Apr 1989
TL;DR: A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism and the average degree of superpipelining metric is introduced, suggesting that this metric is already high for many machines.
Abstract: Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.

316 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations