scispace - formally typeset
Search or ask a question

Showing papers on "Cache pollution published in 1989"


Journal ArticleDOI
TL;DR: An analytical cache model is developed that gives miss rates for a given trace as a function of cache size, degree of associativity, block size, subblock size, multiprogramming level, task switch interval, and observation interval.
Abstract: Trace-driven simulation and hardware measurement are the techniques most often used to obtain accurate performance figures for caches. The former requires a large amount of simulation time to evaluate each cache configuration while the latter is restricted to measurements of existing caches. An analytical cache model that uses parameters extracted from address traces of programs can efficiently provide estimates of cache performance and show the effects of varying cache parameters. By representing the factors that affect cache performance, we develop an analytical model that gives miss rates for a given trace as a function of cache size, degree of associativity, block size, subblock size, multiprogramming level, task switch interval, and observation interval. The predicted values closely approximate the results of trace-driven simulations, while requiring only a small fraction of the computation cost.

345 citations


Patent
06 Jun 1989
TL;DR: In this article, a super-scaler processor with branch-prediction information is described, where each instruction cache block stored in the instruction cache memory includes branch prediction information fields in addition to instruction fields, which indicate the address of the instruction block's successor and information indicating the location of a branch instruction within an instruction block.
Abstract: A super-scaler processor is disclosed wherein branch-prediction information is provided within an instruction cache memory. Each instruction cache block stored in the instruction cache memory includes branch-prediction information fields in addition to instruction fields, which indicate the address of the instruction block's successor and information indicating the location of a branch instruction within the instruction block. Thus, the next cache block can be easily fetched without waiting on a decoder or execution unit to indicate the proper fetch action to be taken for correctly predicted branching.

254 citations


Proceedings ArticleDOI
01 Apr 1989
TL;DR: The code performance with instruction placement optimization is shown to be stable across architectures with different instruction encoding density, and this approach achieves low cache miss ratios and low memory traffic ratios for small, fast instruction caches with little hardware overhead.
Abstract: Increasing the execution power requires a high instruction issue bandwidth, and decreasing instruction encoding and applying some code improving techniques cause code expansion. Therefore, the instruction memory hierarchy performance has become an important factor of the system performance. An instruction placement algorithm has been implemented in the IMPACT-I (Illinois Microarchitecture Project using Advanced Compiler Technology - Stage I) C compiler to maximize the sequential and spatial localities, and to minimize mapping conflicts. This approach achieves low cache miss ratios and low memory traffic ratios for small, fast instruction caches with little hardware overhead. For ten realistic UNIX* programs, we report low miss ratios (average 0.5%) and low memory traffic ratios (average 8%) for a 2048-byte, direct-mapped instruction cache using 64-byte blocks. This result compares favorably with the fully associative cache results reported by other researchers. We also present the effect of cache size, block size, block sectoring, and partial loading on the cache performance. The code performance with instruction placement optimization is shown to be stable across architectures with different instruction encoding density.

227 citations


Journal ArticleDOI
S. McFarling1
01 Apr 1989
TL;DR: This paper presents an optimization algorithm for reducing instruction cache misses that uses profile information to reposition programs in memory so that a direct-mapped cache behaves much like an optimal cache with full associativity and full knowledge of the future.
Abstract: This paper presents an optimization algorithm for reducing instruction cache misses. The algorithm uses profile information to reposition programs in memory so that a direct-mapped cache behaves much like an optimal cache with full associativity and full knowledge of the future. For best results, the cache should have a mechanism for excluding certain instructions designated by the compiler. This paper first presents a reduced form of the algorithm. This form is shown to produce an optimal miss rate for programs without conditionals and with a tree call graph, assuming basic blocks can be reordered at will. If conditionals are allowed, but there are no loops within conditionals, the algorithm does as well as an optimal cache for the worst case execution of the program consistent with the profile information. Next, the algorithm is extended with heuristics for general programs. The effectiveness of these heuristics are demonstrated with empirical results for a set of 10 programs for various cache sizes. The improvement depends on cache size. For a 512 word cache, miss rates for a direct-mapped instruction cache are halved. For an 8K word cache, miss rates fall by over 75%. Over a wide range of cache sizes the algorithm is as effective as increasing the cache size by a factor of 3 times. For 512 words, the algorithm generates only 32% more misses than an optimal cache. Optimized programs on a direct-mapped cache have lower miss rates than unoptimized programs on set-associative caches of the same size.

217 citations


Dissertation
01 Jan 1989
TL;DR: Measurements of actual supercomputer cache performance has not been previously undertaken, and PFC-Sim, a program-driven event tracing facility that can simulate data cache performance of very long programs, is used to measure the performance of various cache structures.
Abstract: Measurements of actual supercomputer cache performance has not been previously undertaken. PFC-Sim is a program-driven event tracing facility that can simulate data cache performance of very long programs. PFC-Sim simulates cache concurrently with program execution, allowing very long traces to be used. Programs with traces in excess of 4 billion entries have been used to measure the performance of various cache structures. PFC-Sim was used to measure the cache performance of array references in a benchmark set of supercomputer applications, RiCEPS. Data cache hit ratios varied on average between 70% for a 16K cache and 91% for a 256K cache. Programs with very large working sets generate poor cache performance even with large caches. The hit ratios of individual references are measured to either 0% or 100%. By locating the references that miss, attempts to improve memory performance can focus on references where improvement is possible. The compiler can estimate the number of loop iterations which can execute without filling the cache, the overflow iteration. The overflow iteration combined with the dependence graph can be used to determine at each reference whether execution will result in hits or misses. Program transformation can be used to improve cache performance by reordering computation to move references to the same memory location closer together, thereby eliminating cache misses. Using the overflow iteration, the compiler can often do this transformation automatically. Standard blocking transformations cannot be used on many loop nests that contain transformation preventing dependences. Wavefront blocking allows any loop nest to be blocked, when the components of dependence vectors are bounded. When the cache misses cannot be eliminated, software prefetching can overlap the miss delays with computation. Software prefetching uses a special instruction to preload values into the cache. A cache load resembles a register load in structure, but does not block computation and only moves the address into cache where a later register load will be required. The compiler can inform the cache (on average) over 100 cycles before a load is required. Cache misses can be serviced in parallel with computation.

210 citations


Proceedings ArticleDOI
05 Dec 1989
TL;DR: The results of this research provide a scheme not only for utilizing the performance enhancement provided by hierarchical memory designs, but also for fine tuning these enhancements to provide increased benefit to the desired scheduling goal.
Abstract: A discussion is presented as to why the present approach to cache architecture design results in unpredictable performance improvements in real-time systems with priority-based preemptive scheduling algorithms. The SMART cache design is shown to be compatible with the goals of scheduling in a real-time system. The results of this research provide a scheme not only for utilizing the performance enhancement provided by hierarchical memory designs, but also for fine tuning these enhancements to provide increased benefit to the desired scheduling goal. >

192 citations


Patent
Gregor Stephen Lee1
17 Jan 1989
TL;DR: In this paper, a hierarchical first-level and second-level memory system includes a first level store queue (18B1) for storing instructions and/or data from a processor (20B) of the multiprocessor system prior to storage in the first level of cache (18A2), a second level store queues (26A2).
Abstract: A multiprocessor system includes a system of store queues and write buffers in a hierarchical first level and second level memory system including a first level store queue (18B1) for storing instructions and/or data from a processor (20B) of the multiprocessor system prior to storage in a first level of cache (18B), a second level store queue (26A2) for storing the in­structions and/or data from the first level store queue (18B1) and a plurality of write buffers (26A2(A); 26A2(B)) for storing the instructions and/or data from the second level store queue prior to storage in a second level of cache. The multiprocessor system includes hierarchical levels of caches and write buffers. When stored in the second level write buffers, access to the shared second level cache is requested; and, when access is granted, the data and/or instruct­ions is moved from the second level write buffers to the shared second level cache. When stored in the shared second level cache, corresponding obsolete entries in the first level of cache are invalidated before any other processor "sees" the obsolete data and the new data and/or instructions are over-written in the first level of cache.

168 citations


Patent
15 May 1989
TL;DR: In this paper, a bus snoop control method for maintaining coherency between a write-back cache and main memory during memory accesses by an alternate bus master is proposed.
Abstract: A bus snoop control method for maintaining coherency between a write-back cache and main memory during memory accesses by an alternate bus master. The method and apparatus incorporates an option to source `dirty` or altered data from the write-back cache to the alternate bus master during a memory read operation, and simultaneously invalidate `dirty` or altered data from the write-back cache. The method minimizes the number of cache accesses required to maintain coherency between the cache and main memory during page-out/page-in sequences initiated by the alternate bus master, thereby improving system performance.

159 citations


Patent
18 Jan 1989
TL;DR: In this article, the authors present a user-oriented approach to flexible cache system design by specifying desired cache features through the setting of appropriate cache option bits, which allows a high performance cache system to be designed with few parts, at low cost and with the ability to perform with high efficiency.
Abstract: Methods and apparatus are disclosed for realizing an integrated cache unit which may be flexibly used for cache system design. The preferred embodiment of the invention comprises both a cache memory and a cache controller on a single chip. In accordance with an alternative embodiment of the invention, the cache memory may be externally located. Flexible cache system design is achieved by the specification of desired cache features through the setting of appropriate cache option bits. The disclosed methods and apparatus support this user oriented approach to flexible system design. The actual setting of option bits may be peformed under software control and allows a high performance cache system to be designed with few parts, at low cost and with the ability to perform with high efficiency.

148 citations


Journal ArticleDOI
01 Apr 1989
TL;DR: T traces of parallel programs are used to evaluate the cache and bus performance of shared memory multiprocessors, in which coherency is maintained by a write-invalidate protocol, and show that parallel programs incur substantially higher miss ratios and bus utilization than comparable uniprocessor programs.
Abstract: Bus bandwidth ultimately limits the performance, and therefore the scale, of bus-based, shared memory multiprocessors. Previous studies have extrapolated from uniprocessor measurements and simulations to estimate the performance of these machines. In this study, we use traces of parallel programs to evaluate the cache and bus performance of shared memory multiprocessors, in which coherency is maintained by a write-invalidate protocol. In particular, we analyze the effect of sharing overhead on cache miss ratio and bus utilization.Our studies show that parallel programs incur substantially higher miss ratios and bus utilization than comparable uniprocessor programs. The sharing component of these metrics proportionally increases with both cache and block size, and for some cache configurations determines both their magnitude and trend. The amount of overhead depends on the memory reference pattern to the shared data. Programs that exhibit good per-processor-locality perform better than those with fine-grain-sharing. This suggests that parallel software writers and better compiler technology can improve program performance through better memory organization of shared data.

130 citations


Patent
22 Jun 1989
TL;DR: In this paper, a system is described wherein a CPU, a main memory means and a bus means are provided to couple the CPU to the bus means and means to indicate the status of a data unit stored within the cache memory means.
Abstract: A system is described wherein a CPU, a main memory means and a bus means are provided. Cache memory means is employed to couple the CPU to the bus means and is further provided with means to indicate the status of a data unit stored within the cache memory means. One status indication tells whether the contents of a storage position have been modified since those contents were received from main memory and another indicates whether the contents of the storage position may be present elsewhere memory means. Control means are provided to assure that when a data unit from a CPU is received and stored in the CPU's associated cache memory means, which data unit is indicated as being also stored in a cache memory means associated with another CPU, such CPU data unit is also written into main memory means. During that process, other cache memory means monitor the bus means and update its corresponding data unit. Bus monitor means are provided and monitor all writes to main memory and reads from main memory to aid in the assurance of system-wide data integrity.

Patent
18 Jan 1989
TL;DR: In this paper, a cache block status field is provided for each cache block to indicate the cache block's state, such as shared or exclusive, when a write hit access to the block occurs, which can be updated by either a TLB write policy field contained within a translation look-aside buffer entry, or by a second input independent of the TLB entry which may be provided from the system on a line basis.
Abstract: A computer system having a cache memory subsystem which allows flexible setting of caching policies on a page basis and a line basis. A cache block status field is provided for each cache block to indicate the cache block's state, such as shared or exclusive. The cache block status field controls whether the cache control unit operates in a write-through write mode or in a copy-back write mode when a write hit access to the block occurs. The cache block status field may be updated by either a TLB write policy field contained within a translation look-aside buffer entry which corresponds to the page of the access, or by a second input independent of the TLB entry which may be provided from the system on a line basis.

Proceedings ArticleDOI
01 Apr 1989
TL;DR: It is shown how the second-level cache can be easily extended to solve the synonym problem resulting from the use of a virtually-addressed cache at the first level and how this organization has a performance advantage over a hierarchy of physically-add addressed caches in a multiprocessor environment.
Abstract: We propose and analyze a two-level cache organization that provides high memory bandwidth. The first-level cache is accessed directly by virtual addresses. It is small, fast, and, without the burden of address translation, can easily be optimized to match the processor speed. The virtually-addressed cache is backed up by a large physically-addressed cache; this second-level cache provides a high hit ratio and greatly reduces memory traffic. We show how the second-level cache can be easily extended to solve the synonym problem resulting from the use of a virtually-addressed cache at the first level. Moreover, the second-level cache can be used to shield the virtually-addressed first-level cache from irrelevant cache coherence interference. Finally, simulation results show that this organization has a performance advantage over a hierarchy of physically-addressed caches in a multiprocessor environment.

Patent
15 May 1989
TL;DR: In this article, the authors propose a data cache controller that uses the multiple dirty bits to determine the quantity and type of accesses required to write the dirty data to memory, so as to minimize the number of memory accesses used to unload a dirty entry.
Abstract: A data cache capable of operation in a write-back (copyback) mode. The data cache design provides a mechanism for making the data cache coherent with memory, without writing the entire cache entry to memory, thereby reducing bus utilization. Each data cache entry is comprised of three items: data, a tag address, and a mixed size status field. The mixed size status fields provide one bit to indicate the validity of the data cache entry and multiple bits to indicate if the entry contains data that has not been written to memory (dirtiness). Multiple dirty bits provide a data cache controller with sufficient information to minimize the number of memory accesses used to unload a dirty entry. The data cache controller uses the multiple dirty bits to determine the quantity and type of accesses required to write the dirty data to memory. The portions of the entry being replaced that are clean (unmodified) are not written to memory.

Patent
26 May 1989
TL;DR: In this paper, an instruction is presented to the cache; the instruction includes a cache control specifier which identifies a type of data being requested, and one of a plurality of replacement schemes is selected for swapping a data block out of the cache.
Abstract: An instruction is presented to the cache; the instruction includes a cache control specifier which identifies a type of data being requested. Based on the cache control specifier, one of a plurality of replacement schemes is selected for swapping a data block out of the cache.

Patent
05 Jun 1989
TL;DR: In this article, a block descriptor table (40 ) is divided into a plurality of sets (42), depending upon the size of the memory cache, and each set is similarly indexed to define memory groups (44) having tag, cache address, and usage information.
Abstract: A controller (10) for use with a hard disk (38) or other mass storage medium provides a memory cache (36). A block descriptor table (40 ) is divided into a plurality of sets (42), depending upon the size of the memory cache (36). Each set is similarly indexed to define memory groups (44) having tag, cache address, and usage information. Upon a read command, an index is generated corresponding to the address requested by the host computer, and the tag information is matched with a generated tag from the address. Each set is checked until a hit occurs or a miss occurs in every set. After each miss, the usage information (50) corresponding to the memory group (44) is decremented. When reading information from the storage device (32) to the memory cache (36), the controller (10) may selectively read additional sectors. The number of sectors read from the storage device may be selectively controlled by the user or the host processor. Further, a cap may be provided to provide a maximum number of sectors to be read.

Journal ArticleDOI
TL;DR: By tolerating such defects without a noticeable performance degradation, the yield of VLSI processors can be enhanced considerably, and a scheme that allows a cache to continue operation in the presence of defective/faulty blocks is suggested.
Abstract: The authors study the tolerance of defects faults in cache memories. They argue that, even though the major components of a cache are linear RAMs (random-access memories), traditional techniques used for fault/defect tolerance in RAMs may be neither appropriate nor necessary for cache memories. They suggest a scheme that allows a cache to continue operation in the presence of defective/faulty blocks. Results are presented of an extensive trace-driven simulation analysis that evaluates the performance degradation of a cache due to defective blocks. From the results it is seen that the on-chip caches of VLSI processors can be organized so that the performance degradation due to a few defective blocks is negligible. The authors conclude that by tolerating such defects without a noticeable performance degradation, the yield of VLSI processors can be enhanced considerably. >

Patent
08 Aug 1989
TL;DR: In this article, a method for selecting a loading method of data stored in cache memory into the cache memory in accordance with an access pattern to the data, and an apparatus therefor are disclosed.
Abstract: In a control unit having a external storage device, a method for selecting a loading method of data stored in the cache memory into the cache memory in accordance with an access pattern to the data, and an apparatus therefor are disclosed. The selection of the loading method is selection of control mode or procedure in accordance with the loading method, and it is attained by a learn function.

Proceedings ArticleDOI
Chi-Hung Chi1, Henry G. Dietz
03 Jan 1989
TL;DR: A technique is proposed to prevent the return of infrequently used items to cache after they are bumped from it, which involves the use of hardware called a bypass-cache, which will determine whether each reference should be through the cache or should bypass the cache and reference main memory directly.
Abstract: A technique is proposed to prevent the return of infrequently used items to cache after they are bumped from it. Simulations have shown that the return of these items, called cache pollution, typically degrade cache-based system performance (average reference time) by 10% to 30%. The technique proposed involves the use of hardware called a bypass-cache, which, under program control, will determine whether each reference should be through the cache or should bypass the cache and reference main memory directly. Several inexpensive heuristics for the compiler to determine how to make each reference are given. It is shown that much of the performance loss can be regained. >

Journal Article
TL;DR: In this paper, a random tester was developed to generate memory references by randomly selecting from a script of actions and checks, which verified correct completion of their corresponding actions, and detected over half of the functional bugs uncovered during simulation.
Abstract: The newest generation of cache controller chips provide coherency to support multiprocessor systems, i.e., the controllers coordinate access to the cache memories to guarantee a single global view of memory. The cache coherency protocols they implement complicate the controller design, making design verification difficult. In the design of the cache controller for SPUR, a shared memory multiprocessor designed and built at U.C. Berkeley, the authors developed a random tester to generate and verify the complex interactions between multiple processors in the the functional simulation. Replacing the CPU model, the tester generates memory references by randomly selecting from a script of actions and checks. The checks verify correct completion of their corresponding actions. The tester was easy to develop, and detected over half of the functional bugs uncovered during simulation. They used an assembly language version of the random tester to verify the prototype hardware. A multiprocessor system is operational; it runs the Sprite operating system and is being used for experiments in parallel programming.

Patent
19 Jul 1989
TL;DR: In this paper, the stack cache is implemented as a set of contiguously addressable registers and two stack pointers are used to implement allocation space in the stack as a circulating buffer.
Abstract: A computer system arranged for faster processing operations by providing a stack cache in internal register memory. A full stack is provided in main memory. The stack cache provides a cache representation of part of the main memory stack. Stack relative addresses contained in procedure instructions are converted to absolute main memory stack addresses. A subset of the absolute main memory stack address is used to directly address the stack cache when a "hit" is detected. Otherwise, the main memory stack is addressed. The stack cache is implemented as a set of contiguously addressable registers. Two stack pointers are used to implement allocation space in the stack as a circulating buffer. Cache hits are detected by comparing the absolute stack address to the contents of the two circular buffer pointers. Space for a procedure is allocated upon entering a procedure. The amount of space to allocate is stored in the first instruction. Space is deallocated when a procedure is terminated. The deallocation space is stored in the first instruction executed after procedure termination.

Proceedings ArticleDOI
01 Apr 1989
TL;DR: It is shown that a first-level cache dramatically reduces the number of references seen by a second- level cache, without having a large effect on the numberof second-level caches misses, which makes associativity more attractive and increases the optimal cache size for second-levels caches over what they would be for an equivalent single-level Cache system.
Abstract: The increasing speed of new generation processors will exacerbate the already large difference between CPU cycle times and main memory access times. As this difference grows, it will be increasingly difficult to build single-level caches that are both fast enough to match these fast cycle times and large enough to effectively hide the slow main memory access times. One solution to this problem is to use a multi-level cache hierarchy. This paper examines the relationship between cache organization and program execution time for multi-level caches. We show that a first-level cache dramatically reduces the number of references seen by a second-level cache, without having a large effect on the number of second-level cache misses. This reduction in the number of second-level cache hits changes the optimal design point by decreasing the importance of the cycle-time of the second-level cache relative to its size. The lower the first-level cache miss rate, the less important the second-level cycle time becomes. This change in relative importance of cycle time and miss rate makes associativity more attractive and increases the optimal cache size for second-level caches over what they would be for an equivalent single-level cache system.

Patent
12 Sep 1989
TL;DR: In this paper, the cache data is transferred to a write back buffer (WBB) during the request for data from the main memory and actual delivery of the requested data, and the ECC hardware also operates on cache data being written to the WBB.
Abstract: A digital computer having a high speed cache (18) and a main memory (10) uses error correction code (ECC) hardware to ensure the integrity of the data delivered between the cache and main memory. To prevent the ECC hardware from slowing the overall operation of the CPU, the error correction is performed underneath a write back operation. Data contained in the cache (18), which will be displaced by data received from main memory (10), is transferred to a write back buffer (WBB) (22) during that period of time between the request for data from the main memory and actual delivery of the requested data. Further, the ECC hardware also operates on the cache data being written to the WBB. Accordingly, a performance penalty is avoided by performing error correction and preremoving the cache data during that idle period of time.

Proceedings ArticleDOI
01 Nov 1989
TL;DR: By modifying NFS to use the Sprite cache consistency protocols, this work finds dramatic improvements on some, although not all, benchmarks, suggesting that an explicit cache consistency protocol is necessary for both correctness and good performance.
Abstract: File caching is essential to good performance in a distributed system, especially as processor speeds and memory sizes continue to improve rapidly while disk latencies do not. Stateless-server systems, such as NFS, cannot properly manage client file caches. Stateful systems, such as Sprite, can use explicit cache consistency protocols to improve both cache consistency and overall performance.By modifying NFS to use the Sprite cache consistency protocols, we isolate the effects of the consistency mechanism from the other features of Sprite. We find dramatic improvements on some, although not all, benchmarks, suggesting that an explicit cache consistency protocol is necessary for both correctness and good performance.

Journal ArticleDOI
TL;DR: This work defines multilevel inclusion properties, gives some necessary and sufficient conditions for these properties to hold in multiprocessor environments, and shows their importance in reducing the complexities of cache coherence protocols.

Patent
17 Apr 1989
TL;DR: In this article, a first processing system is coupled to a plurality of integrated circuits along a P bus and external TAGs coupled between the M bus and the secondary bus are used to maintain coherency between the first and second processing systems.
Abstract: A first processing system is coupled to a plurality of integrated circuits along a P bus. Each of these integrated circuits has a combination cache and memory management unit (MMU). The cache/MMU integrated circuits are also connected to a main memory via an M bus. A second processing system is also coupled to the main memory primarily via a secondary bus but also via the M bus. External TAGs coupled between the M bus and the secondary bus are used to maintain coherency between the first and second processing systems. Each external TAG corresponds to a particular cache/MMU integrated circuit and maintains information as to the status of its corresponding cache/MMU integrated circuit. The cache/MMU integrated circuit provides the necessary status information to its corresponding external TAG in a very efficient manner. Each cache/MMU integrated circuit can also be converted to a SRAM mode in which the cache performs like a conventional high speed static random access memory (SRAM). This ability to convert to a SRAM provides the first processing system with a very efficient scratch pad capability. Each cache/MMU integrated circuit also provides hit information external to the cache/MMU integrated circuit with respect to transactions on the P bus. This hit information is useful in determining system performance.

01 Jan 1989
TL;DR: Examination of shared memory reference patterns in parallel programs that run on bus-based, shared memory multiprocessors reveals two distinct modes of sharing behavior: sequential sharing and fine-grain sharing.
Abstract: This dissertation examines shared memory reference patterns in parallel programs that run on bus-based, shared memory multiprocessors. The study reveals two distinct modes of sharing behavior. In sequential sharing, a processor makes multiple, sequential writes to the words within a block, uninterrupted by accesses from other processors. Under fine-grain sharing, processors contend for these words, and the number of per-processor sequential write is low. Whether a program exhibits sequential or fine- grain sharing affects several factors relating to multiprocessor performance: the accuracy of sharing models that predict cache coherency overhead, the cache miss ratio and bus utilization of parallel programs, and the choice of coherency protocol. An architecture-independent model of write sharing was developed, based on the inter-processor activity to write-shared data. The model was used to predict the relative coherency overhead of write-activity to write- invalidate and write-broadcast protocols. Architecturally detailed simulations validated the model for write-broadcast. Successive refinements incorporating architecture-dependent parameters, most importantly cache block size, produced acceptable predictions for write-invalidate. Block size was crucial for modeling write-invalidate, because the pattern of memory references within a block determines protocol performance. The cache and bus behavior of parallel programs running under write-invalidate protocols was evaluated over various block and cache sizes. The analysis determined the effect of shared memory accesses on cache miss ratio and bus utilization by focusing on the sharing component of these metrics. The studies show that parallel programs incur substantially higher miss ratios and bus utilization than comparable uniprocessor programs. The sharing component of the metrics proportionally increases with cache and block size, and for some cache configurations determines both their magnitude and trend. Again, the amount of overhead depends on the memory reference pattern to the shared data. Programs that exhibit sequential sharing perform better than those whose sharing is fine-grain. A cross-protocol comparison provided empirical evidence of the performance loss caused by increasing block size in write-invalidate protocols and cache size in write-broadcast. It then measured the extent to which read broadcast improved write-invalidate performance and competitive snooping helped write- broadcast. The results indicated that read-broadcast reduced the number of invalidation misses, but at a high cost in processor lockout from the cache The surprising net effect was an increase in total execution cycles. Competitive snooping benefited only those programs that exhibited sequential sharing; both bus utilization and total execution time dropped moderately. For programs characterized by fine-grain sharing, competitive snooping degraded performance by causing a slight increase in these metrics.

Patent
21 Aug 1989
TL;DR: In this article, the replacement selection logic examines the bit pattern in all the cache cells in a set and selects a cache cell to be replaced using a first-in first-out algorithm.
Abstract: A high availability set associative cache memory for use as a buffer between a main memory and a central processing unit includes multiple sets of cache cells contained in two or more cache memory elements Each of the cache cells includes a data field, a tag field and a status field The status field includes a force bit which indicates a defective cache cell when it is set Output from a cache cell is suppressed when its force bit is set The defective cache cell is effectively mapped out so that data is not stored in it As long as one cell in a set remains operational, the system can continue operation The status field also includes an update bit which indicates the update status of the respective cache cell Replacement selection logic examines the bit pattern in all the cache cells in a set and selects a cache cell to be replaced using a first-in first-out algorithm The state of the update bit is changed each time the data in the respective cache cell is replaced unless the cache cell was modified on a previous store cycle

Patent
23 Jan 1989
TL;DR: In this article, the stack cache is allocated to a stack cache as needed to accommodate additional continuation frames during execution of a program, and when a continuation is captured, flags in all segments of stack cache are set to indicate the signals are shared by a captured continuation.
Abstract: Segments of memory are allocated to a stack cache as needed to accommodate additional continuation frames during execution of a program. When a continuation is captured, flags in all segments of the stack cache are set to indicate the signals are shared by a captured continuation, the top segment of the stack cache is copied, and the copy is made the top continuation frame of the stack cache. To invoke a continuation, the top segment of the invoked continuation is copied into the current stack cache segment. When the stack cache is ready to underflow into a segment shared by a captured continuation, the shared segment is copied and the stack cache underflows into the copy.

Proceedings ArticleDOI
01 Mar 1989
TL;DR: An algorithm is presented that exploits a weaker condition than is normally required to achieve greater concurrency and a proof that the algorithm satisfies the safety condition is concluded.
Abstract: This paper examines cache consistency conditions (safety conditions) for multiprocessor shared memory systems. It states and motivates a weaker condition than is normally required. An algorithm is presented that exploits the weaker condition to achieve greater concurrency. The paper concludes with a proof that the algorithm satisfies the safety condition.