scispace - formally typeset
Search or ask a question

Showing papers on "Cache invalidation published in 1989"


Journal ArticleDOI
TL;DR: An analytical cache model is developed that gives miss rates for a given trace as a function of cache size, degree of associativity, block size, subblock size, multiprogramming level, task switch interval, and observation interval.
Abstract: Trace-driven simulation and hardware measurement are the techniques most often used to obtain accurate performance figures for caches. The former requires a large amount of simulation time to evaluate each cache configuration while the latter is restricted to measurements of existing caches. An analytical cache model that uses parameters extracted from address traces of programs can efficiently provide estimates of cache performance and show the effects of varying cache parameters. By representing the factors that affect cache performance, we develop an analytical model that gives miss rates for a given trace as a function of cache size, degree of associativity, block size, subblock size, multiprogramming level, task switch interval, and observation interval. The predicted values closely approximate the results of trace-driven simulations, while requiring only a small fraction of the computation cost.

345 citations


Proceedings ArticleDOI
01 Apr 1989
TL;DR: The code performance with instruction placement optimization is shown to be stable across architectures with different instruction encoding density, and this approach achieves low cache miss ratios and low memory traffic ratios for small, fast instruction caches with little hardware overhead.
Abstract: Increasing the execution power requires a high instruction issue bandwidth, and decreasing instruction encoding and applying some code improving techniques cause code expansion. Therefore, the instruction memory hierarchy performance has become an important factor of the system performance. An instruction placement algorithm has been implemented in the IMPACT-I (Illinois Microarchitecture Project using Advanced Compiler Technology - Stage I) C compiler to maximize the sequential and spatial localities, and to minimize mapping conflicts. This approach achieves low cache miss ratios and low memory traffic ratios for small, fast instruction caches with little hardware overhead. For ten realistic UNIX* programs, we report low miss ratios (average 0.5%) and low memory traffic ratios (average 8%) for a 2048-byte, direct-mapped instruction cache using 64-byte blocks. This result compares favorably with the fully associative cache results reported by other researchers. We also present the effect of cache size, block size, block sectoring, and partial loading on the cache performance. The code performance with instruction placement optimization is shown to be stable across architectures with different instruction encoding density.

227 citations


Journal ArticleDOI
S. McFarling1
01 Apr 1989
TL;DR: This paper presents an optimization algorithm for reducing instruction cache misses that uses profile information to reposition programs in memory so that a direct-mapped cache behaves much like an optimal cache with full associativity and full knowledge of the future.
Abstract: This paper presents an optimization algorithm for reducing instruction cache misses. The algorithm uses profile information to reposition programs in memory so that a direct-mapped cache behaves much like an optimal cache with full associativity and full knowledge of the future. For best results, the cache should have a mechanism for excluding certain instructions designated by the compiler. This paper first presents a reduced form of the algorithm. This form is shown to produce an optimal miss rate for programs without conditionals and with a tree call graph, assuming basic blocks can be reordered at will. If conditionals are allowed, but there are no loops within conditionals, the algorithm does as well as an optimal cache for the worst case execution of the program consistent with the profile information. Next, the algorithm is extended with heuristics for general programs. The effectiveness of these heuristics are demonstrated with empirical results for a set of 10 programs for various cache sizes. The improvement depends on cache size. For a 512 word cache, miss rates for a direct-mapped instruction cache are halved. For an 8K word cache, miss rates fall by over 75%. Over a wide range of cache sizes the algorithm is as effective as increasing the cache size by a factor of 3 times. For 512 words, the algorithm generates only 32% more misses than an optimal cache. Optimized programs on a direct-mapped cache have lower miss rates than unoptimized programs on set-associative caches of the same size.

217 citations


Dissertation
01 Jan 1989
TL;DR: Measurements of actual supercomputer cache performance has not been previously undertaken, and PFC-Sim, a program-driven event tracing facility that can simulate data cache performance of very long programs, is used to measure the performance of various cache structures.
Abstract: Measurements of actual supercomputer cache performance has not been previously undertaken. PFC-Sim is a program-driven event tracing facility that can simulate data cache performance of very long programs. PFC-Sim simulates cache concurrently with program execution, allowing very long traces to be used. Programs with traces in excess of 4 billion entries have been used to measure the performance of various cache structures. PFC-Sim was used to measure the cache performance of array references in a benchmark set of supercomputer applications, RiCEPS. Data cache hit ratios varied on average between 70% for a 16K cache and 91% for a 256K cache. Programs with very large working sets generate poor cache performance even with large caches. The hit ratios of individual references are measured to either 0% or 100%. By locating the references that miss, attempts to improve memory performance can focus on references where improvement is possible. The compiler can estimate the number of loop iterations which can execute without filling the cache, the overflow iteration. The overflow iteration combined with the dependence graph can be used to determine at each reference whether execution will result in hits or misses. Program transformation can be used to improve cache performance by reordering computation to move references to the same memory location closer together, thereby eliminating cache misses. Using the overflow iteration, the compiler can often do this transformation automatically. Standard blocking transformations cannot be used on many loop nests that contain transformation preventing dependences. Wavefront blocking allows any loop nest to be blocked, when the components of dependence vectors are bounded. When the cache misses cannot be eliminated, software prefetching can overlap the miss delays with computation. Software prefetching uses a special instruction to preload values into the cache. A cache load resembles a register load in structure, but does not block computation and only moves the address into cache where a later register load will be required. The compiler can inform the cache (on average) over 100 cycles before a load is required. Cache misses can be serviced in parallel with computation.

210 citations


Patent
Gregor Stephen Lee1
17 Jan 1989
TL;DR: In this paper, a hierarchical first-level and second-level memory system includes a first level store queue (18B1) for storing instructions and/or data from a processor (20B) of the multiprocessor system prior to storage in the first level of cache (18A2), a second level store queues (26A2).
Abstract: A multiprocessor system includes a system of store queues and write buffers in a hierarchical first level and second level memory system including a first level store queue (18B1) for storing instructions and/or data from a processor (20B) of the multiprocessor system prior to storage in a first level of cache (18B), a second level store queue (26A2) for storing the in­structions and/or data from the first level store queue (18B1) and a plurality of write buffers (26A2(A); 26A2(B)) for storing the instructions and/or data from the second level store queue prior to storage in a second level of cache. The multiprocessor system includes hierarchical levels of caches and write buffers. When stored in the second level write buffers, access to the shared second level cache is requested; and, when access is granted, the data and/or instruct­ions is moved from the second level write buffers to the shared second level cache. When stored in the shared second level cache, corresponding obsolete entries in the first level of cache are invalidated before any other processor "sees" the obsolete data and the new data and/or instructions are over-written in the first level of cache.

168 citations


Journal ArticleDOI
01 Apr 1989
TL;DR: This paper analyzes the cache invalidation patterns caused by several parallel applications and investigates the effect of these patterns on a directory-based protocol, and proposes a classification scheme for data objects found in parallel programs and links the invalidation traffic patterns observed in the traces back to these high-level objects.
Abstract: To make shared-memory multiprocessors scalable, researchers are now exploring cache coherence protocols that do not rely on broadcast, but instead send invalidation messages to individual caches that contain stale data. The feasibility of such directory-based protocols is highly sensitive to the cache invalidation patterns that parallel programs exhibit. In this paper, we analyze the cache invalidation patterns caused by several parallel applications and investigate the effect of these patterns on a directory-based protocol. Our results are based on multiprocessor traces with 4, 8 and 16 processors. To gain insight into what the invalidation patterns would look like beyond 16 processors, we propose a classification scheme for data objects found in parallel applications and link the invalidation traffic patterns observed in the traces back to these high-level objects. Our results show that synchronization objects have very different invalidation patterns from those of other data objects. A write reference to a synchronization object usually causes invalidations in many more caches. We point out situations where restructuring the application seems appropriate to reduce the invalidation traffic, and others where hardware support is more appropriate. Our results also show that it should be possible to scale “well-written” parallel programs to a large number of processors without an explosion in invalidation traffic.

168 citations


Proceedings ArticleDOI
01 Apr 1989
TL;DR: The results indicate that the benefits of the extensions to the protocols are limited, and read-broadcast reduces the number of invalidation misses, but at a high cost in processor lockout from the cache.
Abstract: Write-invalidate and write-broadcast coherency protocols have been criticized for being unable to achieve good bus performance across all cache configurations. In particular, write-invalidate performance can suffer as block size increases; and large cache sizes will hurt write-broadcast. Read-broadcast and competitive snooping extensions to the protocols have been proposed to solve each problem. Our results indicate that the benefits of the extensions are limited. Read-broadcast reduces the number of invalidation misses, but at a high cost in processor lockout from the cache. The net effect can be an increase in total execution cycles. Competitive snooping benefits only those programs with high per-processor locality Of reference to shared data. For programs characterized by inter-processor contention for shared addresses, competi- tive snooping can degrade performance by causing a slight increase in bus utilization and total execution time.

165 citations


Patent
18 Jan 1989
TL;DR: In this article, the authors present a user-oriented approach to flexible cache system design by specifying desired cache features through the setting of appropriate cache option bits, which allows a high performance cache system to be designed with few parts, at low cost and with the ability to perform with high efficiency.
Abstract: Methods and apparatus are disclosed for realizing an integrated cache unit which may be flexibly used for cache system design. The preferred embodiment of the invention comprises both a cache memory and a cache controller on a single chip. In accordance with an alternative embodiment of the invention, the cache memory may be externally located. Flexible cache system design is achieved by the specification of desired cache features through the setting of appropriate cache option bits. The disclosed methods and apparatus support this user oriented approach to flexible system design. The actual setting of option bits may be peformed under software control and allows a high performance cache system to be designed with few parts, at low cost and with the ability to perform with high efficiency.

148 citations


Journal ArticleDOI
01 Apr 1989
TL;DR: T traces of parallel programs are used to evaluate the cache and bus performance of shared memory multiprocessors, in which coherency is maintained by a write-invalidate protocol, and show that parallel programs incur substantially higher miss ratios and bus utilization than comparable uniprocessor programs.
Abstract: Bus bandwidth ultimately limits the performance, and therefore the scale, of bus-based, shared memory multiprocessors. Previous studies have extrapolated from uniprocessor measurements and simulations to estimate the performance of these machines. In this study, we use traces of parallel programs to evaluate the cache and bus performance of shared memory multiprocessors, in which coherency is maintained by a write-invalidate protocol. In particular, we analyze the effect of sharing overhead on cache miss ratio and bus utilization.Our studies show that parallel programs incur substantially higher miss ratios and bus utilization than comparable uniprocessor programs. The sharing component of these metrics proportionally increases with both cache and block size, and for some cache configurations determines both their magnitude and trend. The amount of overhead depends on the memory reference pattern to the shared data. Programs that exhibit good per-processor-locality perform better than those with fine-grain-sharing. This suggests that parallel software writers and better compiler technology can improve program performance through better memory organization of shared data.

130 citations


Patent
18 Jan 1989
TL;DR: In this paper, a cache block status field is provided for each cache block to indicate the cache block's state, such as shared or exclusive, when a write hit access to the block occurs, which can be updated by either a TLB write policy field contained within a translation look-aside buffer entry, or by a second input independent of the TLB entry which may be provided from the system on a line basis.
Abstract: A computer system having a cache memory subsystem which allows flexible setting of caching policies on a page basis and a line basis. A cache block status field is provided for each cache block to indicate the cache block's state, such as shared or exclusive. The cache block status field controls whether the cache control unit operates in a write-through write mode or in a copy-back write mode when a write hit access to the block occurs. The cache block status field may be updated by either a TLB write policy field contained within a translation look-aside buffer entry which corresponds to the page of the access, or by a second input independent of the TLB entry which may be provided from the system on a line basis.

118 citations


Patent
John T. Robinson1
08 Aug 1989
TL;DR: In this paper, a cache directory keeps track of which blocks are in the cache, the number of times each block in cache has been referenced after aging at least a predetermined amount (reference count), and the age of each block since the last reference to that block, for use in determining which of the cache blocks is replaced when there is a cache miss.
Abstract: A cache directory keeps track of which blocks are in the cache, the number of times each block in the cache has been referenced after aging at least a predetermined amount (reference count), and the age of each block since the last reference to that block, for use in determining which of the cache blocks is replaced when there is a cache miss. At least one preselected age boundary threshold is utilized to determine when to adjust the reference count for a given block on a cache hit and to select a cache block for replacement as a function of reference count value and block age.

Proceedings ArticleDOI
01 Apr 1989
TL;DR: It is shown how the second-level cache can be easily extended to solve the synonym problem resulting from the use of a virtually-addressed cache at the first level and how this organization has a performance advantage over a hierarchy of physically-add addressed caches in a multiprocessor environment.
Abstract: We propose and analyze a two-level cache organization that provides high memory bandwidth. The first-level cache is accessed directly by virtual addresses. It is small, fast, and, without the burden of address translation, can easily be optimized to match the processor speed. The virtually-addressed cache is backed up by a large physically-addressed cache; this second-level cache provides a high hit ratio and greatly reduces memory traffic. We show how the second-level cache can be easily extended to solve the synonym problem resulting from the use of a virtually-addressed cache at the first level. Moreover, the second-level cache can be used to shield the virtually-addressed first-level cache from irrelevant cache coherence interference. Finally, simulation results show that this organization has a performance advantage over a hierarchy of physically-addressed caches in a multiprocessor environment.

Patent
26 May 1989
TL;DR: In this paper, an instruction is presented to the cache; the instruction includes a cache control specifier which identifies a type of data being requested, and one of a plurality of replacement schemes is selected for swapping a data block out of the cache.
Abstract: An instruction is presented to the cache; the instruction includes a cache control specifier which identifies a type of data being requested. Based on the cache control specifier, one of a plurality of replacement schemes is selected for swapping a data block out of the cache.

Patent
16 Mar 1989
TL;DR: In this article, a Record Lock Processor (RLP) is used in a multi-host data processing system to control the locking of Objects upon request of each of the multiple host data processors in non-conflicting manner.
Abstract: A Record Lock Processor is utilized in a multi-host data processing system to control the locking of Objects upon request of each of the multiple host data processors in non-conflicting manner. The Record Lock Processor has storage provisions which include a Lock List for storing bits that identify the Objects and bits that identify the requesting processor, a Queue List that stores entries that are formatted like the Lock List entry when a prior Lock List entry has been made for the same Object, and a Cache List for each processor that stores Cache List entries that identify each Object that is stored in the cache memories, each of which Cache List entries is associated with one of the requesting processors, wherein such Cache List entries include validity bits that identify whether each Object stored in a Cache List has a Valid or an Invalid status.

Proceedings ArticleDOI
Chi-Hung Chi1, Henry G. Dietz
03 Jan 1989
TL;DR: A technique is proposed to prevent the return of infrequently used items to cache after they are bumped from it, which involves the use of hardware called a bypass-cache, which will determine whether each reference should be through the cache or should bypass the cache and reference main memory directly.
Abstract: A technique is proposed to prevent the return of infrequently used items to cache after they are bumped from it. Simulations have shown that the return of these items, called cache pollution, typically degrade cache-based system performance (average reference time) by 10% to 30%. The technique proposed involves the use of hardware called a bypass-cache, which, under program control, will determine whether each reference should be through the cache or should bypass the cache and reference main memory directly. Several inexpensive heuristics for the compiler to determine how to make each reference are given. It is shown that much of the performance loss can be regained. >

Journal Article
TL;DR: In this paper, a random tester was developed to generate memory references by randomly selecting from a script of actions and checks, which verified correct completion of their corresponding actions, and detected over half of the functional bugs uncovered during simulation.
Abstract: The newest generation of cache controller chips provide coherency to support multiprocessor systems, i.e., the controllers coordinate access to the cache memories to guarantee a single global view of memory. The cache coherency protocols they implement complicate the controller design, making design verification difficult. In the design of the cache controller for SPUR, a shared memory multiprocessor designed and built at U.C. Berkeley, the authors developed a random tester to generate and verify the complex interactions between multiple processors in the the functional simulation. Replacing the CPU model, the tester generates memory references by randomly selecting from a script of actions and checks. The checks verify correct completion of their corresponding actions. The tester was easy to develop, and detected over half of the functional bugs uncovered during simulation. They used an assembly language version of the random tester to verify the prototype hardware. A multiprocessor system is operational; it runs the Sprite operating system and is being used for experiments in parallel programming.


Proceedings ArticleDOI
01 Apr 1989
TL;DR: It is shown that a first-level cache dramatically reduces the number of references seen by a second- level cache, without having a large effect on the numberof second-level caches misses, which makes associativity more attractive and increases the optimal cache size for second-levels caches over what they would be for an equivalent single-level Cache system.
Abstract: The increasing speed of new generation processors will exacerbate the already large difference between CPU cycle times and main memory access times. As this difference grows, it will be increasingly difficult to build single-level caches that are both fast enough to match these fast cycle times and large enough to effectively hide the slow main memory access times. One solution to this problem is to use a multi-level cache hierarchy. This paper examines the relationship between cache organization and program execution time for multi-level caches. We show that a first-level cache dramatically reduces the number of references seen by a second-level cache, without having a large effect on the number of second-level cache misses. This reduction in the number of second-level cache hits changes the optimal design point by decreasing the importance of the cycle-time of the second-level cache relative to its size. The lower the first-level cache miss rate, the less important the second-level cycle time becomes. This change in relative importance of cycle time and miss rate makes associativity more attractive and increases the optimal cache size for second-level caches over what they would be for an equivalent single-level cache system.

Proceedings ArticleDOI
01 Nov 1989
TL;DR: By modifying NFS to use the Sprite cache consistency protocols, this work finds dramatic improvements on some, although not all, benchmarks, suggesting that an explicit cache consistency protocol is necessary for both correctness and good performance.
Abstract: File caching is essential to good performance in a distributed system, especially as processor speeds and memory sizes continue to improve rapidly while disk latencies do not. Stateless-server systems, such as NFS, cannot properly manage client file caches. Stateful systems, such as Sprite, can use explicit cache consistency protocols to improve both cache consistency and overall performance.By modifying NFS to use the Sprite cache consistency protocols, we isolate the effects of the consistency mechanism from the other features of Sprite. We find dramatic improvements on some, although not all, benchmarks, suggesting that an explicit cache consistency protocol is necessary for both correctness and good performance.

Journal ArticleDOI
TL;DR: This work defines multilevel inclusion properties, gives some necessary and sufficient conditions for these properties to hold in multiprocessor environments, and shows their importance in reducing the complexities of cache coherence protocols.

Patent
17 Apr 1989
TL;DR: In this article, a first processing system is coupled to a plurality of integrated circuits along a P bus and external TAGs coupled between the M bus and the secondary bus are used to maintain coherency between the first and second processing systems.
Abstract: A first processing system is coupled to a plurality of integrated circuits along a P bus. Each of these integrated circuits has a combination cache and memory management unit (MMU). The cache/MMU integrated circuits are also connected to a main memory via an M bus. A second processing system is also coupled to the main memory primarily via a secondary bus but also via the M bus. External TAGs coupled between the M bus and the secondary bus are used to maintain coherency between the first and second processing systems. Each external TAG corresponds to a particular cache/MMU integrated circuit and maintains information as to the status of its corresponding cache/MMU integrated circuit. The cache/MMU integrated circuit provides the necessary status information to its corresponding external TAG in a very efficient manner. Each cache/MMU integrated circuit can also be converted to a SRAM mode in which the cache performs like a conventional high speed static random access memory (SRAM). This ability to convert to a SRAM provides the first processing system with a very efficient scratch pad capability. Each cache/MMU integrated circuit also provides hit information external to the cache/MMU integrated circuit with respect to transactions on the P bus. This hit information is useful in determining system performance.

01 Jan 1989
TL;DR: Examination of shared memory reference patterns in parallel programs that run on bus-based, shared memory multiprocessors reveals two distinct modes of sharing behavior: sequential sharing and fine-grain sharing.
Abstract: This dissertation examines shared memory reference patterns in parallel programs that run on bus-based, shared memory multiprocessors. The study reveals two distinct modes of sharing behavior. In sequential sharing, a processor makes multiple, sequential writes to the words within a block, uninterrupted by accesses from other processors. Under fine-grain sharing, processors contend for these words, and the number of per-processor sequential write is low. Whether a program exhibits sequential or fine- grain sharing affects several factors relating to multiprocessor performance: the accuracy of sharing models that predict cache coherency overhead, the cache miss ratio and bus utilization of parallel programs, and the choice of coherency protocol. An architecture-independent model of write sharing was developed, based on the inter-processor activity to write-shared data. The model was used to predict the relative coherency overhead of write-activity to write- invalidate and write-broadcast protocols. Architecturally detailed simulations validated the model for write-broadcast. Successive refinements incorporating architecture-dependent parameters, most importantly cache block size, produced acceptable predictions for write-invalidate. Block size was crucial for modeling write-invalidate, because the pattern of memory references within a block determines protocol performance. The cache and bus behavior of parallel programs running under write-invalidate protocols was evaluated over various block and cache sizes. The analysis determined the effect of shared memory accesses on cache miss ratio and bus utilization by focusing on the sharing component of these metrics. The studies show that parallel programs incur substantially higher miss ratios and bus utilization than comparable uniprocessor programs. The sharing component of the metrics proportionally increases with cache and block size, and for some cache configurations determines both their magnitude and trend. Again, the amount of overhead depends on the memory reference pattern to the shared data. Programs that exhibit sequential sharing perform better than those whose sharing is fine-grain. A cross-protocol comparison provided empirical evidence of the performance loss caused by increasing block size in write-invalidate protocols and cache size in write-broadcast. It then measured the extent to which read broadcast improved write-invalidate performance and competitive snooping helped write- broadcast. The results indicated that read-broadcast reduced the number of invalidation misses, but at a high cost in processor lockout from the cache The surprising net effect was an increase in total execution cycles. Competitive snooping benefited only those programs that exhibited sequential sharing; both bus utilization and total execution time dropped moderately. For programs characterized by fine-grain sharing, competitive snooping degraded performance by causing a slight increase in these metrics.

Patent
21 Aug 1989
TL;DR: In this article, the replacement selection logic examines the bit pattern in all the cache cells in a set and selects a cache cell to be replaced using a first-in first-out algorithm.
Abstract: A high availability set associative cache memory for use as a buffer between a main memory and a central processing unit includes multiple sets of cache cells contained in two or more cache memory elements Each of the cache cells includes a data field, a tag field and a status field The status field includes a force bit which indicates a defective cache cell when it is set Output from a cache cell is suppressed when its force bit is set The defective cache cell is effectively mapped out so that data is not stored in it As long as one cell in a set remains operational, the system can continue operation The status field also includes an update bit which indicates the update status of the respective cache cell Replacement selection logic examines the bit pattern in all the cache cells in a set and selects a cache cell to be replaced using a first-in first-out algorithm The state of the update bit is changed each time the data in the respective cache cell is replaced unless the cache cell was modified on a previous store cycle

Patent
23 Jan 1989
TL;DR: In this article, the stack cache is allocated to a stack cache as needed to accommodate additional continuation frames during execution of a program, and when a continuation is captured, flags in all segments of stack cache are set to indicate the signals are shared by a captured continuation.
Abstract: Segments of memory are allocated to a stack cache as needed to accommodate additional continuation frames during execution of a program. When a continuation is captured, flags in all segments of the stack cache are set to indicate the signals are shared by a captured continuation, the top segment of the stack cache is copied, and the copy is made the top continuation frame of the stack cache. To invoke a continuation, the top segment of the invoked continuation is copied into the current stack cache segment. When the stack cache is ready to underflow into a segment shared by a captured continuation, the shared segment is copied and the stack cache underflows into the copy.

Proceedings ArticleDOI
01 Mar 1989
TL;DR: An algorithm is presented that exploits a weaker condition than is normally required to achieve greater concurrency and a proof that the algorithm satisfies the safety condition is concluded.
Abstract: This paper examines cache consistency conditions (safety conditions) for multiprocessor shared memory systems. It states and motivates a weaker condition than is normally required. An algorithm is presented that exploits the weaker condition to achieve greater concurrency. The paper concludes with a proof that the algorithm satisfies the safety condition.

Proceedings ArticleDOI
01 Jun 1989
TL;DR: A version control approach to maintain cache coherence for large-scale shared-memory multiprocessor systems with interconnection networks makes it possible to exploit temporal locality across synchronization boundaries and achieves a data cache hit ratio closest to maximum possible.
Abstract: A version control approach to maintain cache coherence is proposed for large-scale shared-memory multiprocessor systems with interconnection networks. The new approach, unlike existing approaches for such class of systems, makes it possible to exploit temporal locality across synchronization boundaries. As with the other software-directed approaches, each processor independently manages its cache, i.e., there is no interprocessor communication involved in maintaining cache coherence. The hardware required per processor in the version control approach stays constant as the number of processors increases; hence, it scales up to larger systems. Furthermore, the new approach incurs low overhead. The simulated results of several schemes for large-scale systems show that the new approach achieves a data cache hit ratio closest to maximum possible.

Proceedings ArticleDOI
21 Jun 1989
TL;DR: By using a single model to manage these two memory structures, most redundant copies of values in cache can be eliminated and bus traffic and memory traffic in data cache are greatly reduced and cache effectiveness is improved.
Abstract: In current computer memory system hierarchy, registers and cache are both used to bridge the reference delay gap between the fast processor(s) and the slow main memory. While registers are managed by the compiler using program flow analysis, cache is mainly controlled by hardware without any program understanding. Due to the lack of coordination in managing these two memory structures, significant loss of system performance results because: Cache space is wasted to hold inaccessible copies of values in registers.Inaccessible copies of values replace those accessible ones from cache.Despite the fact that register allocation has long recognized the benefits of live range analysis, current cache management has completely ignored live range information.In this paper, we propose an unified management of registers and cache using liveness and cache bypass. By using a single model to manage these two memory structures, most redundant copies of values in cache can be eliminated. Consequently, bus traffic and memory traffic in data cache are greatly reduced and cache effectiveness is improved.

Patent
Lishing Liu1
17 May 1989
TL;DR: In this article, the authors present a cache memory system with two additional statuses, temporary exclusive and temporary read-only, which allow the cache system to assign an exclusive status on an anticipatory basis without incurring penalties when this assignment is not appropriate.
Abstract: A store-in cache memory system for a multiprocessor computer system has a status entry in the cache directory which is RO (read-only) when a line of data is read-only, and thus accessible by all processors on the system, or EX (exclusive) when the line accessible for reading or writing but only by one processor. In addition, each directory has an entry, CH, which is set when data in the line is changed. The cache memory system includes two additional statuses, TEX, or temporary exclusive, and TRO, or temporary read-only. When a data fetch instruction results in a cache-miss, and a line containing the requested data is found in a remote cache with an EX status and with its CH bit set, the line is copied to the requesting cache and assigned a status of TEX. The line of data in the remote cache receives a status of TRO. If a store operation for the data occurs within a short time frame, the status in the requesting cache changes to EX and the line in the remote cache is invalidated. Otherwise, the data in the line is castout to main storage and the status of the line becomes RO in both the requesting and remote caches. The addition of these statuses allows the cache system to assign an exclusive status on an anticipatory basis without incurring penalties when this assignment is not appropriate.

Patent
16 May 1989
TL;DR: In this paper, the cache can be flushed simply by resetting all the context tags to a null value, which ensures that the data cannot be accessed, but it remains physically in the cache, and will eventually be copied back to the main memory when it is about to be overwritten with different data or when the physical address is next used.
Abstract: A data memory system includes a main memory and a copy-back cache. Each line of the cache has a context tag, which is compared with a current context number to test whether the line contains the required data. The cache can be flushed simply by resetting all the context tags to a null value, which ensures that the data cannot be accessed. However, it remains physically in the cache, and will eventually be copied back to the main memory when it is about to be overwritten with different data or when the physical address is next used.

Patent
30 Mar 1989
TL;DR: In this paper, the authors propose a method for dynamically selecting N or P blocks of data from the cache to move to the cache at high speed, while maintaining the currency of the data in the cache, while simultaneously avoiding writing-over of data already in cache.
Abstract: Optimizing the performance of a cache memory system is disclosed. During operation of a computer system whose processor (120) is supported by virtual cache memory (100), the cache must be cleared and refilled to allow replacement of old data with more current data. The cache is filled with either P or N (N>P) blocks of data. Numerous methods for dynamically selecting N or P blocks of data are possible. For instance, immediately after the cache is flushed, the miss is refilled with N blocks, moving data to the cache at high speed. Once the cache is mostly full, the miss tends to be refiled with P blocks. This maintains the currency of the data in the cache, while simultaneously avoiding writing-over of data already in the cache. The invention is useful in a multi-user/multi-tasking system where the program being run changes frequently, necessitating flushing and clearing the cache frequently.