scispace - formally typeset
Search or ask a question

Showing papers on "Cache invalidation published in 1987"


Journal ArticleDOI
TL;DR: In this article, the authors examined the cache miss ratio as a function of line size, and found that for high performance microprocessor designs, line sizes in the range 16-64 bytes seem best; shorter line sizes yield high delays due to memory latency, although they reduce memory traffic somewhat.
Abstract: The line (block) size of a cache memory is one of the parameters that most strongly affects cache performance. In this paper, we study the factors that relate to the selection of a cache line size. Our primary focus is on the cache miss ratio, but we also consider influences such as logic complexity, address tags, line crossers, I/O overruns, etc. The behavior of the cache miss ratio as a function of line size is examined carefully through the use of trace driven simulation, using 27 traces from five different machine architectures. The change in cache miss ratio as the line size varies is found to be relatively stable across workloads, and tables of this function are presented for instruction caches, data caches, and unified caches. An empirical mathematical fit is obtained. This function is used to extend previously published design target miss ratios to cover line sizes from 4 to 128 bytes and cache sizes from 32 bytes to 32K bytes; design target miss ratios are to be used to guide new machine designs. Mean delays per memory reference and memory (bus) traffic rates are computed as a function of line and cache size, and memory access time parameters. We find that for high performance microprocessor designs, line sizes in the range 16-64 bytes seem best; shorter line sizes yield high delays due to memory latency, although they reduce memory traffic somewhat. Longer line sizes are suitable for mainframes because of the higher bandwidth to main memory.

180 citations


Journal ArticleDOI
TL;DR: This paper develops an analytical model for cache-reload transients and compares the model to observations based on several address traces and shows that the size of the transient is related to the normal distribution function.
Abstract: This paper develops an analytical model for cache-reload transients and compares the model to observations based on several address traces. The cache-reload transient is the set of cache misses that occur when a process is reinitiated after being suspended temporarily. For example, an interrupt program that runs periodically experiences a reload transient at each initiation. The reload transient depends on the cache size and on the sizes of the footprints in the cache of the competing programs, where a program footprint is defined to be the set of lines in the cache in active use by the program. The model shows that the size of the transient is related to the normal distribution function. A simulation based on program-address traces shows excellent agreement between the model and the observations.

131 citations


BookDOI
01 Jan 1987
TL;DR: This work focuses on the development of a Analytical Cache Model for Multiprogramming Cache Performance, which automates the very labor-intensive and therefore time-heavy and expensive process of manually cataloging and analyzing caches.
Abstract: 1 Introduction.- 1.1 Overview of Cache Design.- 1.1.1 Cache Parameters.- 1.1.2 Cache Performance Evaluation Methodology.- 1.2 Review of Past Work.- 1.3 Then, Why This Research?.- 1.3.1 Accurately Characterizing Large Cache Performance.- 1.3.2 Obtaining Trace Data for Cache Analysis.- 1.3.3 Developing Efficient and Accurate Cache Analysis Methods.- 1.4 Contributions.- 1.5 Organization.- 2 Obtaining Accurate Trace Data.- 2.1 Current Tracing Techniques.- 2.2 Tracing Using Microcode.- 2.3 An Experimental Implementation.- 2.3.1 Storage of Trace Data.- 2.3.2 Recording Memory References.- 2.3.3 Tracing Control.- 2.4 Trace Description.- 2.5 Applications in Performance Evaluation.- 2.6 Extensions and Summary.- 3 Cache Analyses Techniques - An Analytical Cache Model.- 3.1 Motivation and Overview.- 3.1.1 The Case for the Analytical Cache Model.- 3.1.2 Overview of the Model.- 3.2 A Basic Cache Model.- 3.2.1 Start-Up Effects.- 3.2.2 Non-Stationary Effects.- 3.2.3 Intrinsic Interference.- 3.3 A Comprehensive Cache Model.- 3.3.1 Set Size.- 3.3.2 Modeling Spatial Locality and the Effect of Block Size.- 3.3.3 Multiprogramming.- 3.4 Model Validation and Applications.- 3.5 Summary.- 4 Transient Cache Analysis - Trace Sampling and Trace Stitching.- 4.1 Introduction.- 4.2 Transient Behavior Analysis and Trace Sampling.- 4.2.1 Definitions.- 4.2.2 Analysis of Start-up Effects in Single Process Traces.- 4.2.3 Start-up Effects in Multiprocess Traces.- 4.3 Obtaining Longer Samples Using Trace Stitching.- 4.4 Trace Compaction - Cache Filtering with Blocking.- 4.4.1 Cache Filter.- 4.4.2 Block Filter.- 4.4.3 Implementation of the Cache and Block Filters.- 4.4.4 Miss Rate Estimation.- 4.4.5 Compaction Results.- 5 Cache Performance Analysis for System References.- 5.1 Motivation.- 5.2 Analysis of the Miss Rate Components due to System References.- 5.3 Analysis of System Miss Rate.- 5.4 Associativity.- 5.5 Block Size.- 5.6 Evaluation of Split Caches.- 6 Impact of Multiprogramming on Cache Performance.- 6.1 Relative Performance of Multiprogramming Cache Techniques.- 6.2 More on Warm Start versus Cold Start.- 6.3 Impact of Shared System Code on Multitasking Cache Performance.- 6.4 Process Switch Statistics and Their Effects on Cache ModeUng.- 6.5 Associativity.- 6.6 Block Size.- 6.7 Improving the Multiprogramming Performance of Caches.- 6.7.1 Hashing.- 6.7.2 A Hash-Rehash Cache.- 6.7.3 Split Caches.- 7 Multiprocessor Cache Analysis.- 7.1 Tracing Multiprocessors.- 7.2 Characteristics of Traces.- 7.3 Analysis.- 7.3.1 General Methodology.- 7.3.2 Multiprocess Interference in Large Virtual and Physical Caches.- 7.3.3 Analysis of Interference Between Multiple Processors.- 7.3.4 Blocks Containing Semaphores.- 8 Conclusions and Suggestions for Future Work.- 8.1 Concluding Remarks.- 8.2 Suggestions for Future Work.- Appendices.- B.1 On the Stability of the Collision Rate.- B.2 Estimating Variations in the Collision Rate.- C Inter-Run Intervals and Spatial Locality.- D Summary of Benchmark Characteristics.- E Features of ATUM-2.- E.1 Distributing Trace Control to All Processors.- E.2 Provision of Atomic Accesses to Trace Memory.- E.3 Instruction Stream Compaction Using a Cache Simulated in Microcode.- E.4 Microcode Patch Space Conservation.

118 citations


Journal ArticleDOI
TL;DR: The authors provide an overview of MIPS-X, focusing on the techniques used to reduce the complexity of the processor and implement the on-chip instruction cache.
Abstract: MIPS-X is a 32-b RISC microprocessor implemented in a conservative 2-/spl mu/m, two-level-metal, n-well CMOS technology. High performance is achieved by using a nonoverlapping two-phase 20-MHz clock and executing one instruction every cycle. To reduce its memory bandwidth requirements, MIPS-X includes a 2-kbyte on-chip instruction cache. The authors provide an overview of MIPS-X, focusing on the techniques used to reduce the complexity of the processor and implement the on-chip instruction cache.

98 citations


Journal ArticleDOI
Douglas B. Terry1
TL;DR: A new approach to managing caches of hints suggests maintaining a minimum level of cache accuracy, rather than maximizing the cache hit ratio, in order to guarantee performance improvements.
Abstract: Caching reduces the average cost of retrieving data by amortizing the lookup cost over several references to the data Problems with maintaining strong cache consistency in a distributed system can be avoided by treating cached information as hints A new approach to managing caches of hints suggests maintaining a minimum level of cache accuracy, rather than maximizing the cache hit ratio, in order to guarantee performance improvements The desired accuracy is based on the ratio of lookup costs to the costs of detecting and recovering from invalid cache entries Cache entries are aged so that they get purged when their estimated accuracy falls below the desired level The age thresholds are dictated solely by clients' accuracy requirements instead of being suggested by data storage servers or system administrators

83 citations


Patent
02 Dec 1987
TL;DR: In this paper, a broadband branch history table is organized by cache line, which determines from the history of branches the next cache line to be referenced and uses that information for prefetching lines into the cache.
Abstract: Apparatus for fetching instructions in a computing system. A broadband branch history table is organized by cache line. The broadband branch history table determines from the history of branches the next cache line to be referenced and uses that information for prefetching lines into the cache.

71 citations


Patent
27 Mar 1987
TL;DR: In this paper, the cache coherence system detects when the contents of storage locations in the cache memories of the one or more of the data processors have been modified in conjuction with the activity those data processors and is responsive to such detections to generate and store in its cache invalidate table (CIT) memory a multiple element linked list.
Abstract: A cache coherence system for a multiprocessor system including a plurality of data processors coupled to a common main memory. Each of the data processors includes an associated cache memory having storage locations therein corresponding to storage locations in the main memory. The cache coherence system for a data processor includes a cache invalidate table (CIT) memory having internal storage locations corresponding to locations in the cache memory of the data processor. The cache coherence system detects when the contents of storage locations in the cache memories of the one or more of the data processors have been modified in conjuction with the activity those data processors and is responsive to such detections to generate and store in its CIT memory a multiple element linked list defining the locations in the cache memories of the data processors having modified contents. Each element of the list defines one of those cache storage locations and also identifies the location in the CIT memory of the next element in the list.

69 citations


01 Jan 1987
TL;DR: These techniques are significant extensions to the stack analysis technique (Mattson et al., 1970) which computes the read miss ratio for all cache sizes in a single trace-driven simulation, and are used to study caching in a network file system.
Abstract: This dissertation describes innovative techniques for efficiently analyzing a wide variety of cache designs, and uses these techniques to study caching in a network file system. The techniques are significant extensions to the stack analysis technique (Mattson et al., 1970) which computes the read miss ratio for all cache sizes in a single trace-driven simulation. Stack analysis is extended to allow the one-pass analysis of: (1) writes in a write-back cache, including periodic write-back and deletions, important factors in file system cache performance. (2) sub-block or sector caches, including load-forward prefetching. (3) multi-processor caches in a shared-memory system, for an entire class of consistency protocols, including all of the well-known protocols. (4) client caches in a network file system, using a new class of consistency protocols. The techniques are completely general and apply to all levels of the memory hierarchy, from processor caches to disk and file system caches. The dissertation also discusses the use of hash tables and binary trees within the simulator to further improve performance for some types of traces. Using these techniques, the performance of all cache sizes can be computed in little more than twice the time required to simulate a single cache size, and often in just 10% more time. In addition to presenting techniques, this dissertation also demonstrates their use by studying client caching in a network file system. It first reports the extent of file sharing in a UNIX environment, showing that a few shared files account for two-thirds of all accesses, and nearly half of these are to files which are both read and written. It then studies different cache consistency protocols, write policies, and fetch policies, reporting the miss ratio and file server utilization for each. Four cache consistency protocols are considered: a polling protocol that uses the server for all consistency controls; a protocol designed for single-user files; one designed for read-only files; and one using write-broadcast to maintain consistency. It finds that the choice of consistency protocol has a substantial effect on performance; both the read-only and write-broadcast protocols showed half the misses and server load of the polling protocol. The choice of write or fetch policy made a much smaller difference.

64 citations


Patent
15 Sep 1987
TL;DR: In this paper, a mechanism for determining when the contents of a block in a cache memory have been rendered stale by DMA activity external to a processor and for marking the block stale in response to a positive determination is proposed.
Abstract: A mechanism for determining when the contents of a block in a cache memory have been rendered stale by DMA activity external to a processor and for marking the block stale in response to a positive determination. The commanding unit in the DMA transfer, prior to transmitting an address, asserts a cache control signal which conditions the processor to receive the address and determine whether there is a correspondence to the contents of the cache. If there is a correspondence, the processor marks the contents of that cache location for which there is a correspondence stale.

63 citations


Dissertation
01 Jul 1987
TL;DR: This dissertation explores possible solutions to the cache coherence problem and identifies Cache coherence protocols--solutions implemented entirely in hardware--as an attractive alternative.
Abstract: Shared-memory multiprocessors offer increased computational power and the programmability of the shared-memory model However, sharing memory between processors leads to contention which delays memory accesses Adding a cache memory for each processor reduces the average access time, but it creates the possibility of inconsistency among cached copies The cache coherence problem is keeping all cached copies of the same memory location identical This dissertation explores possible solutions to the cache coherence problem and identifies cache coherence protocols--solutions implemented entirely in hardware--as an attractive alternative Protocols for shared-bus systems are shown to be an interesting special case Previously proposed shared-bus protocols are described using uniform terminology, and they are shown to divide into two categories: invalidation and distributed write In invalidation protocols all other cached copies must be invalidated before any copy can be changed; in distributed write protocols all copies must be updated each time a shared block is modified In each category, a new protocol is presented with better performance than previous schemes, based on simulation results The simulation model and parameters are described in detail Previous protocols for general interconnection networks are shown to contain flaws and to be costly to implement A new class of protocols is presented that offers reduced implementation cost and expandability, while retaining a high level of performance, as illustrated by simulation results using a crossbar switch All new protocols have been proven correct; one of the proofs is included Previous definitions of cache coherence are shown to be inadequate and a new definition is presented Coherence is compared and contrasted with other levels of consistency, which are also identified The consistency of shared-bus protocols is shown to be naturally stronger than that of non-bus protocols The first protocol of its kind is presented for a large hierarchical multiprocessor, using a bus-based protocol within each cluster and a general protocol in the network connecting the clusters to the shared main memory

57 citations


Proceedings ArticleDOI
01 Jun 1987
TL;DR: In this paper, cache design is explored for large high-performance multiprocessors with hundreds or thousands of processors and memory modules interconnected by a pipe-lined multi-stage network and it is shown that the optimal cache block size in such multiprocessionors is much smaller than in many uniprocessor.
Abstract: In this paper, cache design is explored for large high-performance multiprocessors with hundreds or thousands of processors and memory modules interconnected by a pipe-lined multi-stage network. The majority of the multiprocessor cache studies in the literature exclusively focus on the issue of cache coherence enforcement. However, there are other characteristics unique to such multiprocessors which create an environment for cache performance that is very different from that of many uniprocessors.Multiprocessor conditions are identified and modeled, including, 1) the cost of a cache coherence enforcement scheme, 2) the effect of a high degree of overlap between cache miss services, 3) the cost of a pin limited data path between shared memory and caches, 4) the effect of a high degree of data prefetching, 5) the program behavior of a scientific workload as represented by 23 numerical subroutines, and 6) the parallel execution of programs. This model is used to show that the cache miss ratio is not a suitable performance measure in the multiprocessors of interest and to show that the optimal cache block size in such multiprocessors is much smaller than in many uniprocessors.

Patent
Steven C. Steps1
16 Jun 1987
TL;DR: In this paper, a cache memory architecture which is two blocks wide and made up of a map RAM, two cache data RAMs (each one word wide), and a selection system was presented.
Abstract: Provided is a cache memory architecture which is two blocks wide and is made up of a map RAM, two cache data RAMs (each one word wide), and a selection system for selecting data from either one or both cache data RAMs, depending on whether the access is between cache and CPU, or between cache and main memory. The data stored in the two cache data RAMs has a particular address configuration. It consists of having data with even addresses of even pages and odd addresses of odd pages stored in one cache data RAM, with odd addresses and even addresses interleaved therein; and odd addresses of even pages and even addresses of odd pages stored in the other cache data RAM, with the odd addresses and even addresses interleaved but inverted relative to the other cache data RAM.

Patent
31 Jul 1987
TL;DR: In this paper, a Hashing Indexer for a Branch Cache is proposed for use in a pipelined digital processor that employs macro-instructions utilizing interpretation by micro-insstructions.
Abstract: A Hashing Indexer For a Branch Cache for use in a pipelined digital processor that employs macro-instructions utilizing interpretation by micro-instructions. Each of the macro-instructions has an associated address and each of the micro instructions has an associated address. The hashing indexer includes a look-ahead-fetch system including a branch cache memory coupled to the prefetch section. An indexed table of branch target addressess each of which correspond to the address of a previously fetched instruction is stored in the branch cache memory. A predetermined number of bits representing the address of the macro-instruction being fetched is hashed with a predetermined number of bits representing the address of the micro-instruction being invoked. The indexer is used to apply the hashing result as an address to the branch memory in order to read out a unique predicted branch target address that is predictive of a branch for the hashed macro-instruction bits and micro-instruction bits. The hashing indexer disperses branch cache entries throughout the branch cache memory. Therefore, by hashing macro-instruction bits with micro-instruction bits and by dispersing the branch cache entries throughout the branch cache memory, the prediction rate of the system is increased.

Patent
21 Sep 1987
TL;DR: In this paper, a data processing system having a bus master, a cache, and a memory which is capable of transferring operands in bursts in response to a burst request signal provided by the bus master is described.
Abstract: A data processing system having a bus master, a cache, and a memory which is capable of transferring operands in bursts in response to a burst request signal provided by the bus master. The bus master will provide the burst request signal to the memory in order to fill a line in the cache only if there are no valid entries in that cache line. If a requested operand spans two cache lines, the bus master will defer the burst request signal until the end of the transfer of that operand, so that only the second cache line will be burst filled.

Patent
Stacey G. Lloyd1
16 Dec 1987
TL;DR: In this paper, a burst mode request is made for multiple words (k through n) included in an m-word line of data words (1 through m) to be transferred from the cache to the data processing unit.
Abstract: A method of up-dating a cache (10) backed by a main memory (12). The cache is used as an intermediate high-speed memory between the main memory and a data processing unit (14). A burst mode request is for multiple words (k through n) included in an m-word line of data words (1 through m). The transfer takes place by first determining if the requested data words (k through n) reside in the cache. If they do, then the requested words (k through n) are transferred from the cache to the data processing unit. If they do not, then the requested words (k through n) are transferred simultaneously from the main memory both to the cache and to the data processing unit to thereby update the cache. This cache update is accomplished by first writing the last words of the line containing the requested words only to the cache (starting at word n+1 and ending at word k-1) and then writing the remaining words comprising the requested words (k through n) to the cache and the data processing unit simultaneously (starting at word k and ending at word n).

Patent
19 Jun 1987
TL;DR: In this article, the index is utilized to access the cache to generate an output which includes a block corresponding to the index from each set of the cache, each block includes an address tag and data.
Abstract: A method of retrieving data from a multi-set cache memory in a computer system. An address, which includes an index, is presented by the processor to the cache memory. The index is utilized to access the cache to generate an output which includes a block corresponding to the index from each set of the cache. Each block includes an address tag and data. A portion of the address tag for all but one of the blocks is compared with a corresponding portion of the address. If the comparison results in a match, then the data from the block associated with match is provided to the processor. If the comparison does not result in a match, then the data from the remaining block is provided to the processor. A full address tag comparison is done in parallel with the "lookaside tag" comparison to confirm a "hit."

Patent
09 Jan 1987
TL;DR: In this article, a cache location selector selects locations in a cache for loading new information using either a valid chain, if not all locations already contain valid information, or a history loop otherwise.
Abstract: A cache location selector selects locations in a cache for loading new information using either a valid chain, if not all locations already contain valid information, or a history loop otherwise. The valid chain selects the "highest" location in the cache which does not already contain valid information. The history loop selects locations in accordance with a modified form of the First-In-Not-Used-First-Out (FINUFO) replacement scheme. Both the valid chain and the history loop are fully and efficiently implemented in hardware. During normal cache operation, both the valid chain and the history loop continuously seek an appropriate location to be used for the next load. As a result, that location is preselected well before the load is actually required.

Journal ArticleDOI
TL;DR: The role of cache memories and the factors that decide the success of a particular design are examined, and the operation of a cache memory is described and the specification of cache parameters is considered.
Abstract: The role of cache memories and the factors that decide the success of a particular design are examined. The operation of a cache memory is described. The specification of cache parameters is considered. Also discussed are the size of a cache, cache hierarchies, fetching and replacing, cache organization, updating the main memory, the use of two caches rather than one, virtual-address caches, and cache consistency.

Patent
21 Sep 1987
TL;DR: In this paper, a data processing system having a bus master, a cache, and a memory which is capable of transferring operands in bursts in response to a burst request signal provided by the bus master is described.
Abstract: A data processing system having a bus master, a cache, and a memory which is capable of transferring operands in bursts in response to a burst request signal provided by the bus master. The bus master will provide the burst request signal to the memory in order to fill a line in the cache only if there are no valid entries in that cache line. If a requested operand spans two cache lines, the bus master will defer the burst request signal until the end of the transfer of that operand, so that only the second cache line will be burst filled.

Proceedings Article
01 Jan 1987
TL;DR: This work proposes a new architecture for shared memory multiprocessors, the crosspoint cache architecture, which consists of a crossbar interconnection network with a cache memory at each crosspoint switch and considers a two-level cache architecture in which caches on the processor chips are used in addition to the caches in the crosspoints.
Abstract: We propose a new architecture for shared memory multiprocessors, the crosspoint cache architecture. This architecture consists of a crossbar interconnection network with a cache memory at each crosspoint switch. It assures cache coherence in hardware while avoiding the performance bottlenecks associated with previous hardware cache coherence solutions. We show this architecture is feasible for a 64 processor system. We also consider a two-level cache architecture in which caches on the processor chips are used in addition to the caches in the crosspoints. This two-level cache organization achieves the goals of fast memory access and low bus tra c in a cost e ective way.


Patent
James Gerald Brenza1
03 Apr 1987
TL;DR: In this paper, the authors propose a data processing system which contains a multi-level storage hierarchy, in which the two highest hierarchy levels (e.g. Ll and L2) are private to a single CPU, in order to be in close proximity to each other and to the CPU.
Abstract: The disclosure provides a data processing system which contains a multi-level storage hierarchy, in which the two highest hierarchy levels (e.g. Ll and L2) are private (not shared) to a single CPU, in order to be in close proximity to each other and to the CPU. Each cache has a data line length convenient to the respective cache. A common directory and an L1 control array (L1CA) are provided for the CPU to access both the L1 and L2 caches. The common directory contains and is addressed by the CPU requesting logical addresses, each of which is either a real/absolute address or a virtual address, according to whichever address mode the CPU is in. Each entry in the directory contains a logical address representation derived from a logical address that previously missed in the directory. A CPU request "hits" in the directory if its requested address is in any private cache (e.g. in L1 or L2). A line presence field (LPF) is included in each directory entry to aid in determining a hit in the L1 cache. The L1CA contains Ll cache information to supplement the corresponding common directory entry; the L1CA is used during a L1 LRU castout, but is not the critical path of an L1 or L2 hit. A translation lookaside buffer (TLB) is not used to determine cache hits. The TLB output is used only during the infrequent times that a CPU request misses in the cache directory, and the translated address (i.e. absolute address) is then used to access the data in a synonym location in the same cache, or in main storage, or in the L1 or L2 cache in another CPU in a multiprocessor system using synonym/cross-interrogate directories.


Patent
18 Dec 1987
TL;DR: In this paper, an improved interface between a processor and an external cache system is disclosed, having particular application for use in high speed computer systems, where a cache memory for storing frequently accessed data is coupled to a cache address register (CAR).
Abstract: OF THE INVENTION An improved interface between a processor and an external cache system is disclosed, having particular application for use in high speed computer systems. A cache memory for storing frequently accessed data is coupled to a cache address register (CAR). A processor generates addresses which correspond to locations of desired data in the cache, and provides these addresses to the CAR. Upon the receipt of a clock signal, the CAR couples the address to the cache memory. The processor includes a data register for receiving accessed cache data over a data bus. Data is latched into the register upon the receipt of a clock signal. Due to inherent delays associated with digital logic comprising the processor, clock signals provided by an external clock are received by the CAR prior to their receipt by the processor's data register. This delay (a fraction of a clock cycle) provides additional time to access the cache memory before the data is expected on the data bus. The CAR is fabricated out of a technology that allows it to drive the address to the large capacitive load of the cache memory in much let time than the processor itself could drive such a load. Thus, due to this buffering capability of the CAR, the cache can be much larger than what could be supported by the processor itself The time expended sending the address from the processor to the CAR buffer, which would otherwise not be present if the processor addressed the cache directly from an internal register, does not subtract from the processor cycle time since the processor can compute the cache address and send it to the CAR in less than the time required to access the cache.

01 Dec 1987
TL;DR: A technique called cache filtering with blocking is presented that compresses traces by exploiting both the temporal and spatial locality in the trace, and can reduce trace length by nearly two orders of magnitude.
Abstract: Trace-driven simulation is a popular method of estimating the performance of cache memories, translation lookaside buffers, and paging schemes. Because the cost of trace-driven simulation is directly proportional to trace length, reducing the number of references in the trace significantly impacts simulation time. This paper concentrates on trace-driven simulation for cache analysis. A technique called cache filtering with blocking is presented that compresses traces by exploiting both the temporal and spatial locality in the trace. Experimental results show that this scheme can reduce trace length by nearly two orders of magnitude while introducing less than 15% error in cache miss rate estimates.

Book ChapterDOI
08 Jun 1987
TL;DR: This paper discusses the performance evaluation of private cache memories for multiprocessors in general and presents the results of performance improvement for parallel numerical programs using the caches managed with compiler assistance.
Abstract: In recent years interest has grown in multiprocessor architectures that can have several hundred processors sharing memory and all working on solving a single problem. Such multiprocessors are characterized by a long memory access time, which makes use of cache memories very important. However, the cache coherence problem makes the use of private caches difficult. The proposed solutions to the cache coherence problem are not suitable for a large-scale multiprocessor. We proposed a different solution that relies on a compiler to manage the caches during the execution of a parallel program. In this paper we discuss the performance evaluation of private cache memories for multiprocessors in general and present the results of performance improvement for parallel numerical programs using the caches managed with compiler assistance. The effect of cache organization and other system parameters such as cache block size, cache size, and the number of processors in the system performance is shown. 18 refs., 2 figs., 7 tabs.

Patent
04 Nov 1987
TL;DR: In this paper, the authors proposed to eliminate the danger of data transformation and to assure the data safety by having the mutual communication between controllers in terms of the state of a cache memory part and using the cache memory parts in the same state between both controllers.
Abstract: PURPOSE: To eliminate the danger of data transformation and to assure the data safety by having the mutual communication between controllers in terms of the state of a cache memory part and using the cache memory part in the same state between both controllers. CONSTITUTION: The controllers 2a and 2b control a cache memory part 3 while having the mutual communication. Therefore no discordant state occurs between both controllers 2a and 2b against the part 3. Thus the danger of data transformation is avoided. At the same time, both controllers 2a and 2b can change the state of the part 3. No discordance of states is produced between both controllers 2a and 2b even under such conditions. As a result, no operating mistake occurs at the part 3. In addition, both controllers 2a and 2b are unable to recognize their states with each other due to the disconnection of a 2nd signal line 11 or 12 and therefore the abnormality is confirmed. No data transformation occurs even under such conditions as long as the part 3 is isolated. COPYRIGHT: (C)1989,JPO&Japio

Journal ArticleDOI
Reinder J. Bril1
TL;DR: An implementation independent description of states of blocks in a tightly coupled multi-processor system with private caches is presented, which distinguishes between (abstract) states of states and (implementation oriented) tags.
Abstract: This paper presents an implementation independent approach to cache memories with states of blocks and kinds of blocks.An implementation independent description of states of blocks in a tightly coupled multi-processor system with private caches is presented, which distinguishes between (abstract) states of blocks and (implementation oriented) tags. Two approaches to cache consistency protocols are described using abstract states: the ownership approach and the responsibleness approach.Blocks are looked at as constituents of logical entities such as segments. Different kinds of blocks are distinguished based on different kinds of segments. Whenever caches are able to distinguish between different kinds of blocks, then cache schemes, block sizes, and other implementation related aspects may be chosen independently, facilitating a separation of concerns.

01 Jul 1987
TL;DR: This work presents efficient algorithms for sharing variables on two very different types of distributed systems, a synchronous, bounded degree network of processors each with an associated local memory and an asynchronous broadcast model.
Abstract: Information exchange between processors is essential for any efficient parallel computation. One very convenient mechanism for exchanging information is for processors to utilize shared variables. In this work, we present efficient algorithms for sharing variables on two very different types of distributed systems. The first is a synchronous, bounded degree network of processors each with an associated local memory. The problem we consider in this case is how to distribute and route the shared variables among the processor's local memories. The solution combines the use of universal hash functions for distributing the variables and probabilistic two-phase routing. The second model we consider is an asynchronous broadcast model. Here each processor has a cache and the caches are connected to each other and to a main memory over a bus. Only one variable can be transmitted over the bus per bus cycle, and shared variables may reside in any number of caches. A block retention algorithm associated with each cache monitors bus and processor activity in order to determine which blocks of data to keep and which to discard. There is a tradeoff: If a certain variable is dropped from a cache, then a subsequent read for that variable requires bus communication. On the other hand, if a variable is not dropped from a certain cache, then a write by some other processor to that variable requires a bus cycle so that the first cache will obtain the updated value. For several variants of this model we present on-line block retention algorithms which incur a communication cost that is within a constant factor of the cost to the optimal off-line algorithm.