scispace - formally typeset
Search or ask a question
ReportDOI

Aspects of cache memory and instruction buffer performance

TL;DR: Techniques are developed in this dissertation to efficiently evaluate direct-mapped and set-associative caches and examine instruction caches for single-chip RISC microprocessors, and it is demonstrated that instruction buffers will be preferred to target instruction buffers in future RISCmicroprocessors implemented on single CMOS chips.
Abstract: Techniques are developed in this dissertation to efficiently evaluate direct-mapped and set-associative caches These techniques are used to study associativity in CPU caches and examine instruction caches for single-chip RISC microprocessors This research is motivated in general by the importance of cache memories to computer performance, and more specifically by work done to design the caches in SPUR, a multiprocessor workstation designed at UC Berkeley The studies focus not only on abstract measures of performance such as miss ratios, but also include, when appropriate, detailed implementation factors, such as access times and gate delays The simulation algorithms developed compute miss ratios for numerous alternative caches with one pass through an address trace, provided all caches have the same block size, and use demand fetching and LRU replacement One algorithm (forest simulation) simulates direct-mapped caches by relying on inclusion, a property that all larger caches contain a superset of the data in smaller caches The other algorithm (all associativity simulation) simulates a broader class of direct-mapped and set-associative caches than could previously be studied with a one-pass algorithm, although somewhat less efficiently than forest simulation, since inclusion does not hold The analysis of set-associative caches yields two major results First, constant factors are obtained which relate to the miss ratios for set-associative caches to miss ratios for other set-associative caches Then those results are combined with sample cache implementations to show that above certain cache sizes, direct-mapped caches have lower effective access times than set-associative caches, despite having higher miss ratios Finally, instruction buffers and target instruction buffers are examined as organizations for instruction memory on single-chip microprocessors The analysis focuses closely on implementation considerations, including the interaction between instruction fetches, instruction prefetches and data references, and uses the SPUR RISC design as the case study Results show the effects of varying numerous design parameters, suggest some superior designs, and demonstrate that instruction buffers will be preferred to target instruction buffers in future RISC microprocessors implemented on single CMOS chips
Citations
More filters
Proceedings ArticleDOI
01 May 1990
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

1,481 citations

Book
29 Sep 2011
TL;DR: The Fifth Edition of Computer Architecture focuses on this dramatic shift in the ways in which software and technology in the "cloud" are accessed by cell phones, tablets, laptops, and other mobile computing devices.
Abstract: The computing world today is in the middle of a revolution: mobile clients and cloud computing have emerged as the dominant paradigms driving programming and hardware innovation today. The Fifth Edition of Computer Architecture focuses on this dramatic shift, exploring the ways in which software and technology in the "cloud" are accessed by cell phones, tablets, laptops, and other mobile computing devices. Each chapter includes two real-world examples, one mobile and one datacenter, to illustrate this revolutionary change. Updated to cover the mobile computing revolutionEmphasizes the two most important topics in architecture today: memory hierarchy and parallelism in all its forms.Develops common themes throughout each chapter: power, performance, cost, dependability, protection, programming models, and emerging trends ("What's Next")Includes three review appendices in the printed text. Additional reference appendices are available online.Includes updated Case Studies and completely new exercises.

984 citations

Journal ArticleDOI
TL;DR: All-associativity simulation is theoretically less efficient than forest simulation or stack simulation (a commonly used simulation algorithm), in practice it is not much slower and allows the simulation of many more caches with a single pass through an address trace.
Abstract: The authors present new and efficient algorithms for simulating alternative direct-mapped and set-associative caches and use them to quantify the effect of limited associativity on the cache miss ratio. They introduce an algorithm, forest simulation, for simulating alternative direct-mapped caches and generalize one, which they call all-associativity simulation, for simulating alternative direct-mapped, set-associative, and fully-associative caches. The authors find that although all-associativity simulation is theoretically less efficient than forest simulation or stack simulation (a commonly used simulation algorithm), in practice it is not much slower and allows the simulation of many more caches with a single pass through an address trace. The authors also provide data and insight into how varying associatively affects the miss ratio. >

555 citations

Proceedings ArticleDOI
01 Aug 1991
TL;DR: In this article, a new hardware prefetching scheme based on the prediction of the execution of the instruction stream and associated operand references is proposed. But this scheme requires the use of a reference prediction table and its associated logic.
Abstract: Conventional cache prefetching approaches can be either hardware-based, generally by using a one-blockIookahead technique, or compiler-directed, with insertions of non-blocking prefetch instructions. We introduce a new hardware scheme based on the prediction of the execution of the instruction stream and associated operand references. It consists of a reference prediction table and a look-ahead program counter and its associated logic. With this scheme, data with regular access patterns is preloaded, independently of the stride size, and preloading of data with irregular access patterns is prevented. We evaluate our design through trace driven simulation by comparing it with a pure data cache approach under three different memory access models. Our experiments show that this scheme is very effective for reducing the data access penalty for scientific programs and that is has moderate success for other applications.

458 citations

Proceedings ArticleDOI
01 Dec 1997
TL;DR: The results affirm that significant instruction-level parallelism can be exploited in integer programs (2 to 6 instructions per cycle) and quantify the value of successively doubling the number of distributed elements.
Abstract: Traces are dynamic instruction sequences constructed and cached by hardware. A microarchitecture organized around traces is presented as a means for efficiently executing many instructions per cycle. Trace processors exploit both control flow and data flow hierarchy to overcome complexity and architectural limitations of conventional superscalar processors by (1) distributing execution resources based on trace boundaries and (2) applying control and data prediction at the trace level rather than individual branches or instructions. Three sets of experiments using the SPECInt95 benchmarks are presented. (i) A detailed evaluation of trace processor configurations: the results affirm that significant instruction-level parallelism can be exploited in integer programs (2 to 6 instructions per cycle). We also isolate the impact of distributed resources, and quantify the value of successively doubling the number of distributed elements. (ii) A trace processor with data prediction applied to inter-trace dependences: potential performance improvement with perfect prediction is around 45% for all benchmarks. With realistic prediction, gcc achieves an actual improvement of 10%. (iii) Evaluation of aggressive control flow: some benchmarks benefit from control independence by as much as 10%.

374 citations

References
More filters
Journal ArticleDOI
TL;DR: One of the basic limitations of a digital computer is the size of its available memory; an approach that permits the programmer to use a sufficiently large address range can accomplish this objective, assuming that means are provided for automatic execution of the memory-overlay functions.
Abstract: One of the basic limitations of a digital computer is the size of its available memory. 1 In most cases, it is neither feasible nor economical for a user to insist that every problem program fit into memory. The number of words of information in a program often exceeds the number of cells (i.e., word locations) in memory. The only way to solve this problem is to assign more than one program word to a cell. Since a cell can hold only one word at a time, extra words assigned to the cell must be held in external storage. Conventionally, overlay techniques are employed to exchange memory words and external-storage words whenever needed; this, of course, places an additional planning and coding burden on the programmer. For several reasons, it would be advantageous to rid the programmer of this function by providing him with a “virtual” memory larger than his program. An approach that permits him to use a sufficiently large address range can accomplish this objective, assuming that means are provided for automatic execution of the memory-overlay functions.

1,708 citations

Journal ArticleDOI
TL;DR: A new and efficient method of determining, in one pass of an address trace, performance measures for a large class of demand-paged, multilevel storage systems utilizing a variety of mapping schemes and replacement algorithms.
Abstract: The design of efficient storage hierarchies generally involves the repeated running of "typical" program address traces through a simulated storage system while various hierarchy design parameters are adjusted. This paper describes a new and efficient method of determining, in one pass of an address trace, performance measures for a large class of demand-paged, multilevel storage systems utilizing a variety of mapping schemes and replacement algorithms. The technique depends on an algorithm classification, called "stack algorithms," examples of which are "least frequently used," "least recently used," "optimal," and "random replacement" algorithms. The techniques yield the exact access frequency to each storage device, which can be used to estimate the overall performance of actual storage hierarchies.

1,329 citations

Journal ArticleDOI
TL;DR: It is shown that prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.
Abstract: Memory transfers due to a cache miss are costly. Prefetching all memory references in very fast computers can increase the effective CPU speed by 10 to 25 percent.

315 citations

Journal ArticleDOI
01 May 1986
TL;DR: A range of schemes for reducing branch cost focusing on both static (compile-time) and dynamic (hardware-assisted) prediction of branches are examined, from quantitative performance and implementation viewpoints.
Abstract: Pipelining is the major organizational technique that computers use to reach higher single-processor performance. A fundamental disadvantage of pipelining is the loss incurred due to branches that require stalling or flushing the pipeline. Both hardware solutions and architectural changes have been proposed to overcome these problems. This paper examines a range of schemes for reducing branch cost focusing on both static (compile-time) and dynamic (hardware-assisted) prediction of branches. These schemes are investigated from quantitative performance and implementation viewpoints.1

283 citations