scispace - formally typeset
Search or ask a question

Showing papers on "Smart Cache published in 1987"


Journal ArticleDOI
TL;DR: In this article, the authors examined the cache miss ratio as a function of line size, and found that for high performance microprocessor designs, line sizes in the range 16-64 bytes seem best; shorter line sizes yield high delays due to memory latency, although they reduce memory traffic somewhat.
Abstract: The line (block) size of a cache memory is one of the parameters that most strongly affects cache performance. In this paper, we study the factors that relate to the selection of a cache line size. Our primary focus is on the cache miss ratio, but we also consider influences such as logic complexity, address tags, line crossers, I/O overruns, etc. The behavior of the cache miss ratio as a function of line size is examined carefully through the use of trace driven simulation, using 27 traces from five different machine architectures. The change in cache miss ratio as the line size varies is found to be relatively stable across workloads, and tables of this function are presented for instruction caches, data caches, and unified caches. An empirical mathematical fit is obtained. This function is used to extend previously published design target miss ratios to cover line sizes from 4 to 128 bytes and cache sizes from 32 bytes to 32K bytes; design target miss ratios are to be used to guide new machine designs. Mean delays per memory reference and memory (bus) traffic rates are computed as a function of line and cache size, and memory access time parameters. We find that for high performance microprocessor designs, line sizes in the range 16-64 bytes seem best; shorter line sizes yield high delays due to memory latency, although they reduce memory traffic somewhat. Longer line sizes are suitable for mainframes because of the higher bandwidth to main memory.

180 citations


BookDOI
01 Jan 1987
TL;DR: This work focuses on the development of a Analytical Cache Model for Multiprogramming Cache Performance, which automates the very labor-intensive and therefore time-heavy and expensive process of manually cataloging and analyzing caches.
Abstract: 1 Introduction.- 1.1 Overview of Cache Design.- 1.1.1 Cache Parameters.- 1.1.2 Cache Performance Evaluation Methodology.- 1.2 Review of Past Work.- 1.3 Then, Why This Research?.- 1.3.1 Accurately Characterizing Large Cache Performance.- 1.3.2 Obtaining Trace Data for Cache Analysis.- 1.3.3 Developing Efficient and Accurate Cache Analysis Methods.- 1.4 Contributions.- 1.5 Organization.- 2 Obtaining Accurate Trace Data.- 2.1 Current Tracing Techniques.- 2.2 Tracing Using Microcode.- 2.3 An Experimental Implementation.- 2.3.1 Storage of Trace Data.- 2.3.2 Recording Memory References.- 2.3.3 Tracing Control.- 2.4 Trace Description.- 2.5 Applications in Performance Evaluation.- 2.6 Extensions and Summary.- 3 Cache Analyses Techniques - An Analytical Cache Model.- 3.1 Motivation and Overview.- 3.1.1 The Case for the Analytical Cache Model.- 3.1.2 Overview of the Model.- 3.2 A Basic Cache Model.- 3.2.1 Start-Up Effects.- 3.2.2 Non-Stationary Effects.- 3.2.3 Intrinsic Interference.- 3.3 A Comprehensive Cache Model.- 3.3.1 Set Size.- 3.3.2 Modeling Spatial Locality and the Effect of Block Size.- 3.3.3 Multiprogramming.- 3.4 Model Validation and Applications.- 3.5 Summary.- 4 Transient Cache Analysis - Trace Sampling and Trace Stitching.- 4.1 Introduction.- 4.2 Transient Behavior Analysis and Trace Sampling.- 4.2.1 Definitions.- 4.2.2 Analysis of Start-up Effects in Single Process Traces.- 4.2.3 Start-up Effects in Multiprocess Traces.- 4.3 Obtaining Longer Samples Using Trace Stitching.- 4.4 Trace Compaction - Cache Filtering with Blocking.- 4.4.1 Cache Filter.- 4.4.2 Block Filter.- 4.4.3 Implementation of the Cache and Block Filters.- 4.4.4 Miss Rate Estimation.- 4.4.5 Compaction Results.- 5 Cache Performance Analysis for System References.- 5.1 Motivation.- 5.2 Analysis of the Miss Rate Components due to System References.- 5.3 Analysis of System Miss Rate.- 5.4 Associativity.- 5.5 Block Size.- 5.6 Evaluation of Split Caches.- 6 Impact of Multiprogramming on Cache Performance.- 6.1 Relative Performance of Multiprogramming Cache Techniques.- 6.2 More on Warm Start versus Cold Start.- 6.3 Impact of Shared System Code on Multitasking Cache Performance.- 6.4 Process Switch Statistics and Their Effects on Cache ModeUng.- 6.5 Associativity.- 6.6 Block Size.- 6.7 Improving the Multiprogramming Performance of Caches.- 6.7.1 Hashing.- 6.7.2 A Hash-Rehash Cache.- 6.7.3 Split Caches.- 7 Multiprocessor Cache Analysis.- 7.1 Tracing Multiprocessors.- 7.2 Characteristics of Traces.- 7.3 Analysis.- 7.3.1 General Methodology.- 7.3.2 Multiprocess Interference in Large Virtual and Physical Caches.- 7.3.3 Analysis of Interference Between Multiple Processors.- 7.3.4 Blocks Containing Semaphores.- 8 Conclusions and Suggestions for Future Work.- 8.1 Concluding Remarks.- 8.2 Suggestions for Future Work.- Appendices.- B.1 On the Stability of the Collision Rate.- B.2 Estimating Variations in the Collision Rate.- C Inter-Run Intervals and Spatial Locality.- D Summary of Benchmark Characteristics.- E Features of ATUM-2.- E.1 Distributing Trace Control to All Processors.- E.2 Provision of Atomic Accesses to Trace Memory.- E.3 Instruction Stream Compaction Using a Cache Simulated in Microcode.- E.4 Microcode Patch Space Conservation.

118 citations


Journal ArticleDOI
TL;DR: The authors provide an overview of MIPS-X, focusing on the techniques used to reduce the complexity of the processor and implement the on-chip instruction cache.
Abstract: MIPS-X is a 32-b RISC microprocessor implemented in a conservative 2-/spl mu/m, two-level-metal, n-well CMOS technology. High performance is achieved by using a nonoverlapping two-phase 20-MHz clock and executing one instruction every cycle. To reduce its memory bandwidth requirements, MIPS-X includes a 2-kbyte on-chip instruction cache. The authors provide an overview of MIPS-X, focusing on the techniques used to reduce the complexity of the processor and implement the on-chip instruction cache.

98 citations


Journal ArticleDOI
Douglas B. Terry1
TL;DR: A new approach to managing caches of hints suggests maintaining a minimum level of cache accuracy, rather than maximizing the cache hit ratio, in order to guarantee performance improvements.
Abstract: Caching reduces the average cost of retrieving data by amortizing the lookup cost over several references to the data Problems with maintaining strong cache consistency in a distributed system can be avoided by treating cached information as hints A new approach to managing caches of hints suggests maintaining a minimum level of cache accuracy, rather than maximizing the cache hit ratio, in order to guarantee performance improvements The desired accuracy is based on the ratio of lookup costs to the costs of detecting and recovering from invalid cache entries Cache entries are aged so that they get purged when their estimated accuracy falls below the desired level The age thresholds are dictated solely by clients' accuracy requirements instead of being suggested by data storage servers or system administrators

83 citations


Patent
02 Dec 1987
TL;DR: In this paper, a broadband branch history table is organized by cache line, which determines from the history of branches the next cache line to be referenced and uses that information for prefetching lines into the cache.
Abstract: Apparatus for fetching instructions in a computing system. A broadband branch history table is organized by cache line. The broadband branch history table determines from the history of branches the next cache line to be referenced and uses that information for prefetching lines into the cache.

71 citations


Proceedings ArticleDOI
J. H. Chang1, H. Chao1, K. So1
01 Jun 1987
TL;DR: An innovative cache accessing scheme based on high MRU (most recently used) hit ratio is proposed for the design of a one-cycle cache in a CMOS implementation of System/370 and it is shown that with this scheme the cache access time is reduced, and the performance is within 4% of a true one- cycle cache.
Abstract: An innovative cache accessing scheme based on high MRU (most recently used) hit ratio [1] is proposed for the design of a one-cycle cache in a CMOS implementation of System/370. It is shown that with this scheme the cache access time is reduced by 30 ~ 35% and the performance is within 4% of a true one-cycle cache. This cache scheme is proposed to be used in a VLSI System/370, which is organized to achieve high performance by taking advantage of the performance and integration level of an advanced CMOS technology with half-micron channel length [2]. Decisions on the system partition are based on technology limitations, performance considerations and future extendability. Design decisions on various aspects of the cache organization are based on trace simulations for both UP (uniprocessor) and MP (multiprocessor) configurations.

68 citations


01 Jan 1987
TL;DR: These techniques are significant extensions to the stack analysis technique (Mattson et al., 1970) which computes the read miss ratio for all cache sizes in a single trace-driven simulation, and are used to study caching in a network file system.
Abstract: This dissertation describes innovative techniques for efficiently analyzing a wide variety of cache designs, and uses these techniques to study caching in a network file system. The techniques are significant extensions to the stack analysis technique (Mattson et al., 1970) which computes the read miss ratio for all cache sizes in a single trace-driven simulation. Stack analysis is extended to allow the one-pass analysis of: (1) writes in a write-back cache, including periodic write-back and deletions, important factors in file system cache performance. (2) sub-block or sector caches, including load-forward prefetching. (3) multi-processor caches in a shared-memory system, for an entire class of consistency protocols, including all of the well-known protocols. (4) client caches in a network file system, using a new class of consistency protocols. The techniques are completely general and apply to all levels of the memory hierarchy, from processor caches to disk and file system caches. The dissertation also discusses the use of hash tables and binary trees within the simulator to further improve performance for some types of traces. Using these techniques, the performance of all cache sizes can be computed in little more than twice the time required to simulate a single cache size, and often in just 10% more time. In addition to presenting techniques, this dissertation also demonstrates their use by studying client caching in a network file system. It first reports the extent of file sharing in a UNIX environment, showing that a few shared files account for two-thirds of all accesses, and nearly half of these are to files which are both read and written. It then studies different cache consistency protocols, write policies, and fetch policies, reporting the miss ratio and file server utilization for each. Four cache consistency protocols are considered: a polling protocol that uses the server for all consistency controls; a protocol designed for single-user files; one designed for read-only files; and one using write-broadcast to maintain consistency. It finds that the choice of consistency protocol has a substantial effect on performance; both the read-only and write-broadcast protocols showed half the misses and server load of the polling protocol. The choice of write or fetch policy made a much smaller difference.

64 citations


Patent
15 Sep 1987
TL;DR: In this paper, a mechanism for determining when the contents of a block in a cache memory have been rendered stale by DMA activity external to a processor and for marking the block stale in response to a positive determination is proposed.
Abstract: A mechanism for determining when the contents of a block in a cache memory have been rendered stale by DMA activity external to a processor and for marking the block stale in response to a positive determination. The commanding unit in the DMA transfer, prior to transmitting an address, asserts a cache control signal which conditions the processor to receive the address and determine whether there is a correspondence to the contents of the cache. If there is a correspondence, the processor marks the contents of that cache location for which there is a correspondence stale.

63 citations


Proceedings ArticleDOI
01 Jun 1987
TL;DR: In this paper, cache design is explored for large high-performance multiprocessors with hundreds or thousands of processors and memory modules interconnected by a pipe-lined multi-stage network and it is shown that the optimal cache block size in such multiprocessionors is much smaller than in many uniprocessor.
Abstract: In this paper, cache design is explored for large high-performance multiprocessors with hundreds or thousands of processors and memory modules interconnected by a pipe-lined multi-stage network. The majority of the multiprocessor cache studies in the literature exclusively focus on the issue of cache coherence enforcement. However, there are other characteristics unique to such multiprocessors which create an environment for cache performance that is very different from that of many uniprocessors.Multiprocessor conditions are identified and modeled, including, 1) the cost of a cache coherence enforcement scheme, 2) the effect of a high degree of overlap between cache miss services, 3) the cost of a pin limited data path between shared memory and caches, 4) the effect of a high degree of data prefetching, 5) the program behavior of a scientific workload as represented by 23 numerical subroutines, and 6) the parallel execution of programs. This model is used to show that the cache miss ratio is not a suitable performance measure in the multiprocessors of interest and to show that the optimal cache block size in such multiprocessors is much smaller than in many uniprocessors.

56 citations


Patent
Steven C. Steps1
16 Jun 1987
TL;DR: In this paper, a cache memory architecture which is two blocks wide and made up of a map RAM, two cache data RAMs (each one word wide), and a selection system was presented.
Abstract: Provided is a cache memory architecture which is two blocks wide and is made up of a map RAM, two cache data RAMs (each one word wide), and a selection system for selecting data from either one or both cache data RAMs, depending on whether the access is between cache and CPU, or between cache and main memory. The data stored in the two cache data RAMs has a particular address configuration. It consists of having data with even addresses of even pages and odd addresses of odd pages stored in one cache data RAM, with odd addresses and even addresses interleaved therein; and odd addresses of even pages and even addresses of odd pages stored in the other cache data RAM, with the odd addresses and even addresses interleaved but inverted relative to the other cache data RAM.

37 citations


Journal ArticleDOI
TL;DR: The role of cache memories and the factors that decide the success of a particular design are examined, and the operation of a cache memory is described and the specification of cache parameters is considered.
Abstract: The role of cache memories and the factors that decide the success of a particular design are examined. The operation of a cache memory is described. The specification of cache parameters is considered. Also discussed are the size of a cache, cache hierarchies, fetching and replacing, cache organization, updating the main memory, the use of two caches rather than one, virtual-address caches, and cache consistency.

Proceedings Article
01 Jan 1987
TL;DR: This work proposes a new architecture for shared memory multiprocessors, the crosspoint cache architecture, which consists of a crossbar interconnection network with a cache memory at each crosspoint switch and considers a two-level cache architecture in which caches on the processor chips are used in addition to the caches in the crosspoints.
Abstract: We propose a new architecture for shared memory multiprocessors, the crosspoint cache architecture. This architecture consists of a crossbar interconnection network with a cache memory at each crosspoint switch. It assures cache coherence in hardware while avoiding the performance bottlenecks associated with previous hardware cache coherence solutions. We show this architecture is feasible for a 64 processor system. We also consider a two-level cache architecture in which caches on the processor chips are used in addition to the caches in the crosspoints. This two-level cache organization achieves the goals of fast memory access and low bus tra c in a cost e ective way.



Patent
18 Dec 1987
TL;DR: In this paper, an improved interface between a processor and an external cache system is disclosed, having particular application for use in high speed computer systems, where a cache memory for storing frequently accessed data is coupled to a cache address register (CAR).
Abstract: OF THE INVENTION An improved interface between a processor and an external cache system is disclosed, having particular application for use in high speed computer systems. A cache memory for storing frequently accessed data is coupled to a cache address register (CAR). A processor generates addresses which correspond to locations of desired data in the cache, and provides these addresses to the CAR. Upon the receipt of a clock signal, the CAR couples the address to the cache memory. The processor includes a data register for receiving accessed cache data over a data bus. Data is latched into the register upon the receipt of a clock signal. Due to inherent delays associated with digital logic comprising the processor, clock signals provided by an external clock are received by the CAR prior to their receipt by the processor's data register. This delay (a fraction of a clock cycle) provides additional time to access the cache memory before the data is expected on the data bus. The CAR is fabricated out of a technology that allows it to drive the address to the large capacitive load of the cache memory in much let time than the processor itself could drive such a load. Thus, due to this buffering capability of the CAR, the cache can be much larger than what could be supported by the processor itself The time expended sending the address from the processor to the CAR buffer, which would otherwise not be present if the processor addressed the cache directly from an internal register, does not subtract from the processor cycle time since the processor can compute the cache address and send it to the CAR in less than the time required to access the cache.

Journal ArticleDOI
Reinder J. Bril1
TL;DR: An implementation independent description of states of blocks in a tightly coupled multi-processor system with private caches is presented, which distinguishes between (abstract) states of states and (implementation oriented) tags.
Abstract: This paper presents an implementation independent approach to cache memories with states of blocks and kinds of blocks.An implementation independent description of states of blocks in a tightly coupled multi-processor system with private caches is presented, which distinguishes between (abstract) states of blocks and (implementation oriented) tags. Two approaches to cache consistency protocols are described using abstract states: the ownership approach and the responsibleness approach.Blocks are looked at as constituents of logical entities such as segments. Different kinds of blocks are distinguished based on different kinds of segments. Whenever caches are able to distinguish between different kinds of blocks, then cache schemes, block sizes, and other implementation related aspects may be chosen independently, facilitating a separation of concerns.


Proceedings Article
01 Jan 1987
TL;DR: A systematic measurement-based methodology for characterizing the amount of concurrency present in a workload, and the effect of Concurrency on system performance indices such as cache miss rate and bus activity are developed.
Abstract: A systematic measurement-based methodology for characterizing the amount of concurrency present in a workload, and the effect of concurrency on system performance indices such as cache miss rate and bus activity are developed. Hardware and software instrumentation of an Alliant FX/8 was used to obtain data from a real workload environment. Results show that 35% of the workload is concurrent, with the concurrent periods typically using all available processors. Measurements of periods of change in concurrency show uneven usage of processors during these times. Other system measures, including cache miss rate and processor bus activity, are analyzed with respect to the concurrency measures. Probability of a cache miss is seen to increase with concurrency. The change in cache miss rate is much more sensitive to the fraction of concurrent code in the worklaod than the number of processors active during concurrency. Regression models are developed to quantify the relationships between cache miss rate, bus activity, and the concurrency measures. The model for cache miss rate predicts an increase in the median miss rate value as much as 300% for a 100% increase in concurrency in the workload.

01 Nov 1987
TL;DR: The external interface of MIPS-X is described, which has been designed to optimize the paths between the processor, the external cache and the coprocessors.
Abstract: MIPS-X is a 20-MIPS-peak VLSI processor designed at Stanford University. This document describes the external interface of MIPS-X and the organization of the MIPS-X processor system, including the external cache and coprocessors. The external interface has been designed to optimize the paths between the processor, the external cache and the coprocessors. The signals used by the processor and their timing are documented here. Signal use and timings during exceptions and cache misses are also shown.


Proceedings ArticleDOI
03 Feb 1987
TL;DR: Two statistical models are developed to estimate the effect of chip failures on cache memory systems and one predicts the degradation in the expected Read time taking into account the different failure modes of a memory chip.
Abstract: Two statistical models are developed to estimate the effect of chip failures on cache memory systems. The first one predicts the degradation in the expected Read time taking into account the different failure modes of a memory chip. It is seen that there is a significant degradation in the expected access time after only four weeks of operation even if failed words are deallocated. The second model estimates the degradation in the Miss ratio due to the deallocation of failed sections of cache. Both models can help in setting suitable preventive maintenance schedules as well as in making design decisions.

Book ChapterDOI
01 Jan 1987
TL;DR: This work presents new on-line algorithms to be used by the caches of snoopy cache multiprocessor systems to decide which blocks to retain and which to drop in order to minimize communication over the bus.
Abstract: In a snoopy cache multiprocessor system, each processor has a cache in which it stores blocks of data. Each cache is connected to a bus used to communicate with the other caches and with main memory. Each cache monitors the activity on the bus and in its own processor and decides which blocks of data to keep and which to discard. For several of the proposed architectures for snoopy caching systems, we present new on-line algorithms to be used by the caches to decide which blocks to retain and which to drop in order to minimize communication over the bus. We prove that, for any sequence of operations, our algorithms' communication costs are within a constant factor of the minimum required for that sequence; for some of our algorithms we prove that no on-line algorithm has this property with a smaller constant.