scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Cache Memories

01 Sep 1982-ACM Computing Surveys-Vol. 14, Iss: 3, pp 473-530
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.
Citations
More filters
Book
15 Aug 1998
TL;DR: This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures and provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.
Abstract: The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software techniques for addressing each of these issues but also explores how these techniques interact in the same system. Examining architecture from an application-driven perspective, it provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions. * synthesizes a decade of research and development for practicing engineers, graduate students, and researchers in parallel computer architecture, system software, and applications development * presents in-depth application case studies from computer graphics, computational science and engineering, and data mining to demonstrate sound quantitative evaluation of design trade-offs * describes the process of programming for performance, including both the architecture-independent and architecture-dependent aspects, with examples and case-studies * illustrates bus-based and network-based parallel systems with case studies of more than a dozen important commercial designs Table of Contents 1 Introduction 2 Parallel Programs 3 Programming for Performance 4 Workload-Driven Evaluation 5 Shared Memory Multiprocessors 6 Snoop-based Multiprocessor Design 7 Scalable Multiprocessors 8 Directory-based Cache Coherence 9 Hardware-Software Tradeoffs 10 Interconnection Network Design 11 Latency Tolerance 12 Future Directions APPENDIX A Parallel Benchmark Suites

1,571 citations

Proceedings ArticleDOI
01 May 1990
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

1,481 citations

Journal ArticleDOI
TL;DR: Both theoretical and practical results show that the memory coherence problem can indeed be solved efficiently on a loosely coupled multiprocessor.
Abstract: The memory coherence problem in designing and implementing a shared virtual memory on loosely coupled multiprocessors is studied in depth. Two classes of algorithms, centralized and distributed, for solving the problem are presented. A prototype shared virtual memory on an Apollo ring based on these algorithms has been implemented. Both theoretical and practical results show that the memory coherence problem can indeed be solved efficiently on a loosely coupled multiprocessor.

1,139 citations

Book
29 Sep 2011
TL;DR: The Fifth Edition of Computer Architecture focuses on this dramatic shift in the ways in which software and technology in the "cloud" are accessed by cell phones, tablets, laptops, and other mobile computing devices.
Abstract: The computing world today is in the middle of a revolution: mobile clients and cloud computing have emerged as the dominant paradigms driving programming and hardware innovation today. The Fifth Edition of Computer Architecture focuses on this dramatic shift, exploring the ways in which software and technology in the "cloud" are accessed by cell phones, tablets, laptops, and other mobile computing devices. Each chapter includes two real-world examples, one mobile and one datacenter, to illustrate this revolutionary change. Updated to cover the mobile computing revolutionEmphasizes the two most important topics in architecture today: memory hierarchy and parallelism in all its forms.Develops common themes throughout each chapter: power, performance, cost, dependability, protection, programming models, and emerging trends ("What's Next")Includes three review appendices in the printed text. Additional reference appendices are available online.Includes updated Case Studies and completely new exercises.

984 citations

Posted Content
TL;DR: In this article, the authors present a detailed and structured presentation of the publicly available information on SGX, a series of intelligent guesses about some important but undocumented aspects of SGX.
Abstract: Intel’s Software Guard Extensions (SGX) is a set of extensions to the Intel architecture that aims to provide integrity and confidentiality guarantees to securitysensitive computation performed on a computer where all the privileged software (kernel, hypervisor, etc) is potentially malicious. This paper analyzes Intel SGX, based on the 3 papers [14, 78, 137] that introduced it, on the Intel Software Developer’s Manual [100] (which supersedes the SGX manuals [94, 98]), on an ISCA 2015 tutorial [102], and on two patents [108, 136]. We use the papers, reference manuals, and tutorial as primary data sources, and only draw on the patents to fill in missing information. This paper’s contributions are a summary of the Intel-specific architectural and micro-architectural details needed to understand SGX, a detailed and structured presentation of the publicly available information on SGX, a series of intelligent guesses about some important but undocumented aspects of SGX, and an analysis of SGX’s security properties.

834 citations

References
More filters
Journal ArticleDOI
TL;DR: One of the basic limitations of a digital computer is the size of its available memory; an approach that permits the programmer to use a sufficiently large address range can accomplish this objective, assuming that means are provided for automatic execution of the memory-overlay functions.
Abstract: One of the basic limitations of a digital computer is the size of its available memory. 1 In most cases, it is neither feasible nor economical for a user to insist that every problem program fit into memory. The number of words of information in a program often exceeds the number of cells (i.e., word locations) in memory. The only way to solve this problem is to assign more than one program word to a cell. Since a cell can hold only one word at a time, extra words assigned to the cell must be held in external storage. Conventionally, overlay techniques are employed to exchange memory words and external-storage words whenever needed; this, of course, places an additional planning and coding burden on the programmer. For several reasons, it would be advantageous to rid the programmer of this function by providing him with a “virtual” memory larger than his program. An approach that permits him to use a sufficiently large address range can accomplish this objective, assuming that means are provided for automatic execution of the memory-overlay functions.

1,708 citations

Journal ArticleDOI
TL;DR: A new and efficient method of determining, in one pass of an address trace, performance measures for a large class of demand-paged, multilevel storage systems utilizing a variety of mapping schemes and replacement algorithms.
Abstract: The design of efficient storage hierarchies generally involves the repeated running of "typical" program address traces through a simulated storage system while various hierarchy design parameters are adjusted. This paper describes a new and efficient method of determining, in one pass of an address trace, performance measures for a large class of demand-paged, multilevel storage systems utilizing a variety of mapping schemes and replacement algorithms. The technique depends on an algorithm classification, called "stack algorithms," examples of which are "least frequently used," "least recently used," "optimal," and "random replacement" algorithms. The techniques yield the exact access frequency to each storage device, which can be used to estimate the overall performance of actual storage hierarchies.

1,329 citations

Journal ArticleDOI
TL;DR: A new model, the “working set model,” is developed, defined to be the collection of its most recently used pages, which provides knowledge vital to the dynamic management of paged memories.
Abstract: Probably the most basic reason behind the absence of a general treatment of resource allocation in modern computer systems is an adequate model for program behavior. In this paper a new model, the “working set model,” is developed. The working set of pages associated with a process, defined to be the collection of its most recently used pages, provides knowledge vital to the dynamic management of paged memories. “Process” and “working set” are shown to be manifestations of the same ongoing computational activity; then “processor demand” and “memory demand” are defined; and resource allocation is formulated as the problem of balancing demands against available equipment.

995 citations

Book
01 Mar 1995
TL;DR: In this article, the authors describe the methods employed in the floating-point area of the System/360 Model 91 to exploit the existence of multiple execution units and register tagging schemes.
Abstract: This paper describes the methods employed in the floating-point area of the System/360 Model 91 to exploit the existence of multiple execution units Basic to these techniques is a simple common data busing and register tagging scheme which permits simultaneous execution of independent instructions while preserving the essential precedences inherent in the instruction stream The common data bus improves performance by efficiently utilizing the execution units without requiring specially optimized code Instead, the hardware, by 'looking ahead' about eight instructions, automatically optimizes the program execution on a local basis The application of these techniques is not limited to floating-point arithmetic or System/360 architecture It may be used in almost any computer having multiple execution units and one or more 'accumulators' Both of the execution units, as well as the associated storage buffers, multiple accumulators and input/output buses, are extensively checked

784 citations

Journal ArticleDOI
Censier1, Feautrier
TL;DR: A memory hierarchy has coherence problems as soon as one of its levels is split in several independent units which are not equally accessible from faster levels or processors.
Abstract: A memory hierarchy has coherence problems as soon as one of its levels is split in several independent units which are not equally accessible from faster levels or processors. The classical solution to these problems, as found for instance in multiprocessor, multicache systems, is to restore a degree of interdependence between such units through a set of high speed interconnecting buses. This solution is not entirely satisfactory, as it tends to reduce the throughput of the memory hierarchy and to increase its cost.

700 citations