scispace - formally typeset
Search or ask a question

Showing papers on "Cache invalidation published in 1991"


Proceedings ArticleDOI
01 Apr 1991
TL;DR: It is shown that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes.
Abstract: Blocking is a well-known optimization technique for improving the effectiveness of memory hierarchies. Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks, so that data loaded into the faster levels of the memory hierarchy are reused. This paper presents cache performance data for blocked programs and evaluates several optimization to improve this performance. The data is obtained by a theoretical model of data conflicts in the cache, which has been validated by large amounts of simulation. We show that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes. The conventional wisdom of frying to use the entire cache, or even a fixed fraction of the cache, is incorrect. If a fixed block size is used for a given cache size, the block size that minimizes the expected number of cache misses is very small. Tailoring the block size according to the matrix size and cache parameters can improve the average performance and reduce the variance in performance for different matrix sizes. Finally, whenever possible, it is beneficial to copy non-contiguous reused data into consecutive locations.

982 citations


Proceedings ArticleDOI
01 Apr 1991
TL;DR: The LimitLESS scheme uses a combination of hardware and software techniques to realize the performance of a full-map directory with the memory overhead of a limited directory, supported by Alewife, a large-scale multiprocessor.
Abstract: Caches enhance the performance of multiprocessors by reducing network traffic and average memory access latency. However, cache-based systems must address the problem of cache coherence. We propose the LimitLESS directory protocol to solve this problem. The LimitLESS scheme uses a combination of hardware and software techniques to realize the performance of a full-map directory with the memory overhead of a limited directory. This protocol is supported by Alewife, a large-scale multiprocessor. We describe the architectural interfaces needed to implement the LimitLESS directory, and evaluate its performance through simulations of the Alewife machine.

348 citations


Patent
06 Feb 1991
TL;DR: The branch prediction cache (BPC) as mentioned in this paper provides a tag identifying the address of instructions causing a branch, a record of the target address which was branched to on the last occurrence of each branch instruction, and a copy of the first several instructions beginning at this target address.
Abstract: The present invention provides for the updating of both the instructions in a branch prediction cache and instructions recently provided to an instruction pipeline from the cache when an instruction being executed attempts to change such instructions ("Store-Into-Instruction-Stream"). The branch prediction cache (BPC) includes a tag identifying the address of instructions causing a branch, a record of the target address which was branched to on the last occurrence of each branch instruction, and a copy of the first several instructions beginning at this target address. A separate instruction cache is provided for normal execution of instructions, and all of the instructions written into the branch prediction cache from the system bus must also be stored in the instruction cache. The instruction cache monitors the system bus for attempts to write to the address of an instruction contained in the instruction cache. Upon such a detection, that entry in the instruction cache is invalidated, and the corresponding entry in the branch prediction cache is invalidated. A subsequent attempt to use an instruction in the branch prediction cache which has been invalidated will detect that it is not valid, and will instead go to main memory to fetch the instruction, where it has been modified.

279 citations


Proceedings ArticleDOI
01 Apr 1991
TL;DR: This work fed address traces of the processes running on a multi-tasking operating system through a cache simulator, to compute accurate cache-hit rates over short intervals, and estimated the cache performance reduction caused by a context switch.
Abstract: The sustained performance of fast processors is critically dependent on cache performance. Cache performance in turn depends on locality of reference. When an operating system switches contexts, the assumption of locality may be violated because the instructions and data of the newly-scheduled process may no longer be in the cache(s). Context-switching thus has a cost above that associated with that of the operations performed by the kernel. We fed address traces of the processes running on a multi-tasking operating system through a cache simulator, to compute accurate cache-hit rates over short intervals. By marking the output of such a simulation whenever a context switch occurs, and then aggregating the post-context-switch results of a large number of context switches, it is possible to estimate the cache performance reduction caused by a switch. Depending on cache parameters the net cost of a context switch appears to be in the thousands of cycles, or tens to hundreds of microseconds. This technical note is a preprint of a paper to appear in the Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.

272 citations


Proceedings ArticleDOI
01 Apr 1991
TL;DR: This paper presents a range of lock-based cache consistency algorithms that arise by viewing cache consistency as aiant of the well-understood problem of replicated data management, and uses a detailed simulation model to study the performance of these algorithm over a wide range of workloads end system resource configurations.
Abstract: In this paper, we examine the performance tradeoffs that are raised by caching data in the client workstations of a client-server DBMS. We begin by presenting a range of lock-based cache consistency algorithms that arise by viewing cache consistency as a v~iant of the well-understood problem of replicated data management. We then use a detailed simulation model to study the performance of these algorithm over a wide range of workloads end system resource configurations. The results illustrate the key performance tradeoffs related to clientserver cache consistency, and should be of use to designers of next-generation DBMS prototypes and products.

230 citations


Patent
16 May 1991
TL;DR: In this paper, a microprocessor partially decodes instructions retrieved from main memory before placing them into the microprocessor's integrated instruction cache, each storage location in the instruction cache includes two slots for decoded instructions.
Abstract: A microprocessor partially decodes instructions retrieved from main memory before placing them into the microprocessor's integrated instruction cache. Each storage location in the instruction cache includes two slots for decoded instructions. One slot controls one of the microprocessor's integer pipelines and a port to the microprocessor's data cache. A second slot controls the second integer pipeline or one of the microprocessor's floating point units. The instructions retrieved from main memory are decoded by a loader unit which decodes the instructions from the compact form as stored in main memory and places them into the two slots of the instruction cache entry according to their functions. In addition, auxiliary information is placed in the cache entry along with the instruction to control parallel execution as well as emulation of complex instructions. A bit in each instruction cache entry indicates whether the instructions in the two slots are independent, so that they can be executed in parallel, or dependent, so that they must be executed sequentially. Using a single bit for this purpose allows two dependent instructions to be stored in the slots of the single cache entry.

203 citations


Patent
16 May 1991
TL;DR: In this article, a microprocessor partially decodes instructions retrieved from main memory before placing them into the microprocessor's integrated instruction cache, each storage location in the instruction cache includes two slots for decoded instructions.
Abstract: of EP0459232A microprocessor partially decodes instructions retrieved from main memory before placing them into the microprocessor's integrated instruction cache. Each storage location in the instruction cache includes two slots for decoded instructions. One slot controls one of the microprocessor's integer pipelines and a port to the microprocessor's data cache. A second slot controls the second integer pipeline or one of the microprocessor's floating point units. The instructions retrieved from main memory are decoded by a loader unit which decodes the instructions from the compact form as stored in main memory and places them into the two slots of the instruction cache entry according to their functions. In addition, auxiliary information is placed in the cache entry along with the instruction to control parallel execution as well as emulation of complex instructions. A bit in each instruction cache entry indicates whether the instructions in the two slots are independent, so that they can be executed in parallel, or dependent, so that they must be executed sequentially. Using a single bit for this purpose allows two dependent instructions to be stored in the slots of the single cache entry.

169 citations


Proceedings ArticleDOI
01 Apr 1991
TL;DR: This paper reports the cache performance of a set of vectorized numerical program from the Perfect Club benchmarks and describes two simple prefetch schemes to reduce the influence of long stride vector accesses and misses due IO block invalidations in mulliprocessor vector caches.
Abstract: This paper reports the cache performance of a set of vectorized numerical program from the Perfect Club benchmarks. Using a low cost trace driven simularion technique we show how a non-prefetching vector cache can result in unpredictable performance and how rhis unpredictability makes it difficult to find a good block size. We describe two simple prefetch schemes to reduce the influence of long stride vector accesses and misses due IO block invalidations in mulliprocessor vector caches. These two schemes are shown to have better performance than a non-prefetching cache.

163 citations


Proceedings ArticleDOI
01 Apr 1991
TL;DR: Five application cache consistency algorithms are examined m a cltent/server database system: two-phase locking, certification, callback locking, no-wait locking, and no-Wait locking with notification.
Abstract: This, paper. examines five application cache consistency algorithms m a cltent/server database system: two-phase locking, certification, callback locking, no-wait locking, and no-wait locking with notification. A simulator was developed to compare the average transaction respnse time and server throughput for these algorithms under different workloads and system configurations. Two-phase locking and callback locking dominate no-wait locking and no-wait locking with notification when the server or the network is a bottleneck. Callback locking is better than two-phase locking when the inter-transaction locality is high or when intertrensaction locality is medium and the probability of object update is low. When there is no network delay and the server is very fast, no-wait locking with notification end callback locking dominate two-phase and no-wait locking.

146 citations


Patent
Richard Lewis Mattson1
20 May 1991
TL;DR: In this paper, the cache hierarchy is logically partitioned to form a least recently used (LRU) global cache and a plurality of LRU destaging local caches, where each local cache is bound to objects having a unique data type T(i), where i is indicative of a DataType.
Abstract: A method for managing a cache hierarchy having a fixed total storage capacity is disclosed. The cache hierarchy is logically partitioned to form a least recently used (LRU) global cache and a plurality of LRU destaging local caches. The global cache stores objects of all types and maintains them in LRU order. In contrast, each local cache is bound to objects having a unique data type T(i), where i is indicative of a DataType. Read and write accesses by referencing processors or central processing units (CPU's) are made to the global cache. Data not available in the global cache is staged thereto either from one of the local caches or from external storage. When a cache full condition is reached, placement of the most recently used (MRU) data element to the top of the global cache results in an LRU data element of type T(i) being destaged from the global cache to a corresponding one of the local caches storing type T(i) data. Likewise, when a cache full condition is reached in any one or more of the local caches, the local caches in turn will destage their LRU data elements to external storage. The parameters defining the partitions are externally supplied.

132 citations


Patent
19 Aug 1991
TL;DR: In this paper, a scheduler allocates engines to processes and schedules the processes to run on the basis of priority and engine availability, which increases computing system performance and reduces bus traffic.
Abstract: A computing system (50) includes N number of symmetrical computing engines having N number of cache memories joined by a system bus (12). The computing system includes a global run queue (54), an FPA global run queue, and N number of affinity run queues (58). Each engine is associated with one affinity run queue, which includes multiple slots. When a process first becomes runnable, it is typically attached one of the global run queues. A scheduler allocates engines to processes and schedules the processes to run on the basis of priority and engine availability. An engine typically stops running a process before it is complete. When the process becomes runnable again the scheduler estimates the remaining cache context for the process in the cache of the engine. The scheduler uses the estimated amount of cache context in deciding in which run queue a process is to be enqueued. The process is enqueued to the affinity run queue of the engine when the estimated cache context of the process is sufficiently high, and is enqueued onto the global run queue when the cache context is sufficiently low. The procedure increases computing system performance and reduces bus traffic because processes will run on engines having sufficient cache affinity, but will also run on the best available engine when there is insufficient cache context.

Patent
Kevin Frank Smith1
18 Nov 1991
TL;DR: In this article, the signed difference between an assigned and actual hit ratio is used to dynamically allocate cache size to a class of data and its correlation with priority of that class, and cache space is dynamically reallocated among the partitions in a direction that forces the weighted hit rate slope for each partition into equality over the entire cache.
Abstract: Methods for managing a Least Recently Used (LRU) cache in a staged storage system on a prioritized basis permitting management of data using multiple cache priorities assigned at the data set level. One method uses the signed difference between an assigned and actual hit ratio to dynamically allocate cache size to a class of data and its correlation with priority of that class. Advantageously, the hit ratio performance of a class of data can not be degraded beyond a predetermined level by a class of data having lower priority. Another method dynamically reallocates cache space among partitions asymptotically to the general equality of a weighted hit rate slope function attributable to each partition. This function is the product of the slope of a weighting factor and the partition hit rate versus partition space allocation function. Cache space is dynamically reallocated among the partitions in a direction that forces the weighted hit rate slope for each partition into equality over the entire cache. Partition space is left substantially unchanged in those partitions where the weighted hit rate slope is non-positive. Hit rate is defined as hit ratio times I/O rate.

Patent
20 Aug 1991
TL;DR: In this paper, a multilevel cache buffer for a multiprocessor system is described, where each processor has a level one cache storage unit which interfaces with a level two cache unit and main storage unit shared by all processors.
Abstract: A multilevel cache buffer for a multiprocessor system in which each processor has a level one cache storage unit which interfaces with a level two cache unit and main storage unit shared by all processors. The multiprocessors share the level two cache according to a priority algorithm. When data in the level two cache is updated, corresponding data in level one caches is invalidated until it is updated.

Patent
30 Aug 1991
TL;DR: In this article, a method and apparatus for incorporating cache line replacement and cache write policy information into the tag directories in a cache system is presented, which can be generalized to caches which include a number of ways greater than two by using a pseudo-LRU algorithm and utilizing group select bits in each way to distinguish between least recently used groups.
Abstract: A method and apparatus for incorporating cache line replacement and cache write policy information into the tag directories in a cache system. In a 2 way set-associative cache, one bit in each way's tag RAM is reserved for LRU information, and the bits are manipulated such that the Exclusive-OR of each way's bits points to the actual LRU cache way. Since all of these bits must be read when the cache controller determines whether a hit or miss has occurred, the bits are available when a cache miss occurs and a cache line replacement is required. The method can be generalized to caches which include a number of ways greater than two by using a pseudo-LRU algorithm and utilizing group select bits in each of the ways to distinguish between least recently used groups. Cache write policy information is stored in the tag RAM's to designate various memory areas as write-back or write-through. In this manner, system memory situated on an I/O bus which does not recognize inhibit cycles can have its data cached.

Patent
09 Jul 1991
TL;DR: In this paper, a method for data consistency between a plurality of individual processor cache memories and the main memory in a multi-processor computer system is provided which is capable of detecting when one of a set of predefined data inconsistency states occurs as a data transaction request is being processed, and correcting the data inconsistencies states so that the operation may be executed in a correct and consistent manner.
Abstract: A method for insuring data consistency between a plurality of individual processor cache memories and the main memory in a multi-processor computer system is provided which is capable of (1) detecting when one of a set of predefined data inconsistency states occurs as a data transaction request is being processed, and (2) correcting the data inconsistency states so that the operation may be executed in a correct and consistent manner. In particular, the method is adapted to address two kinds of data inconsistency states: (1) A request for a operation from a system unit to main memory when the location to be written to is present in the cache of some processor unit-in such a case, data in the cache is "stale" and the data inconsistency is avoided by preventing the associated processor from using the "stale" data; and (2) when a read operation is requested of main memory by a system unit and the location to be read may be written or has already been written in the cache of some processor--in this case, the data in main memory is "stale" and the data inconsistency is avoided by insuring that the data returned to the requesting unit is the updated data in the cache. The presence of one of the above-described data inconsistency states is detected in a SCU-based multi-processing system by providing the SCU with means for maintaining a copy of the cache directories for each of the processor caches. The SCU continually compares address data accompanying memory access requests with what is stored in the SCU cache directories in order to determine the presence of predefined conditions indicative of data inconsistencies, and subsequently executes corresponding predefined fix-up sequences.

Proceedings ArticleDOI
01 Apr 1991
TL;DR: This paper proposes an efficient cachebased access anomaly detection scheme that piggybacks on the overhead already paid by the underlying cache coherence protocol.
Abstract: One important issue in parallel program debugging is the efficient detection of access anomalies caused by uncoordinated accesses to shared variables. On-the-fly detection of access anomalies has two advantages over static analysis or post-mortem trace analysis. First, it reports only actual anomalies during execution. Second, it produces shorter traces for post-mortem analysis purposes if an anomaly is detected, since generating further trace information after the detection of an anomaly is of dubious value. Existing methods for on-the-fly access anomaly detection suffer from performance penalties since the execution of the program being debugged has to be interrupted on every access to shared variables. In this paper, we propose an efficient cachebased access anomaly detection scheme that piggybacks on the overhead already paid by the underlying cache coherence protocol.

Patent
Igal Megory-Cohen1
01 Nov 1991
TL;DR: In this paper, a modified steepest descent method is proposed to handle unpredictable local cache activities prior to cache repartitioning to avoid readjustments which would result in unacceptably small or negative cache sizes in cases where a local cache is extremely underutilized.
Abstract: Dynamic partitioning of cache storage into a plurality of local caches for respective classes of competing processes is performed by a step of dynamically determining adjustments to the cache partitioning using a steepest descent method. A modified steepest descent method allows unpredictable local cache activities prior to cache repartitioning to be taken into account to avoid readjustments which would result in unacceptably small or, even worse, negative cache sizes in cases where a local cache is extremely underutilized. The method presupposes a unimodal distribution of cache misses.

Patent
23 May 1991
TL;DR: In this paper, a directory-based protocol for maintaining data coherency in a multiprocessing (MP) system having a number of processors with associated write-back caches, a multistage interconnection network (MIN) leading to a shared memory, and a global directory associated with the main memory to keep track of state and control information of cache lines.
Abstract: A directory-based protocol is provided for maintaining data coherency in a multiprocessing (MP) system having a number of processors with associated write-back caches, a multistage interconnection network (MIN) leading to a shared memory, and a global directory associated with the main memory to keep track of state and control information of cache lines. Upon a request by a requesting cache for a cache line which has been exclusively modified by a source cache, two buffers are situated in the global directory to collectively intercept modified data words of the modified cache line during the write-back to memory. A modified word buffer is used to capture modified words within the modified cache line. Moreover, a line buffer stores an old cache line transferred from the memory, during the write back operation. Finally, the line buffer and the modified word buffer, together, provide the entire modified line to a requesting cache.

Patent
Jamshed H. Mirza1
15 Apr 1991
TL;DR: In this paper, a cache bypass mechanism automatically avoids caching of data for instructions whose data references, for whatever reason, exhibit low cache hit ratio, and this record is used to decide whether its future references should be cached or not.
Abstract: A cache bypass mechanism automatically avoids caching of data for instructions whose data references, for whatever reason, exhibit low cache hit ratio. The mechanism keeps a record of an instruction's behavior in the immediate past, and this record is used to decide whether its future references should be cached or not. If an instruction is experiencing bad cache hit ratio, it is marked as non-cacheable, and its data references are made to bypass the cache. This avoids the additional penalty of unnecessarily fetching the remaining words in the line, reduces the demand on the memory bandwidth, avoids flushing the cache of useful data and, in parallel processing environments, prevents line thrashing. The cache management scheme is automatic and requires no compiler or user intervention.

Proceedings ArticleDOI
01 Aug 1991
TL;DR: In this article, the authors introduce several implementations of delayed consistency for cache-based systems in the framework of a weakly ordered consistency model, and a performance comparison of the delayed protocols with the corre sponding On-the-Fly (non-delayed) consistency protocol is made.
Abstract: In cache based multiprocessors a protocol must maintain coherence among replicated copies of shared writable data. In delayed consistency protocols the effect of out-going and in-coming invalidations or updates are delayed. Delayed coherence can reduce processor blocking time as well as the effects offalse sharing. In this paper, we introduce several implementations of delayed consistency for cache-based systems in the framework of a weakly­ ordered consistency model. A performance comparison of the delayed protocols with the corre sponding On-the-Fly (non-delayed) consistency protocol is made, through execution-driven simulations of four parallel algorithms. The results show that,for parallel programs in which false sharing is a problem, significant reductions in the data miss rate of paraUel programs can be obtained with just a small incre ase in the cost and complexity of the cache system.

Patent
16 May 1991
TL;DR: In this article, a microprocessor architecture that includes capabilities for locking individual entries into its integrated instruction cache and data cache while leaving the remainder of the cache unlocked and available for use in capturing the microprocessor's dynamic locality of reference is presented.
Abstract: A microprocessor architecture that includes capabilities for locking individual entries into its integrated instruction cache and data cache while leaving the remainder of the cache unlocked and available for use in capturing the microprocessor's dynamic locality of reference. The microprocessor also includes the capability for locking instruction cache entries without requiring that the instructions be executed during the locking process.

Proceedings Article
01 Jan 1991
TL;DR: This paper introduces the Express Ring architecture and presents a snooping cache coherence protocol for this machine, and shows how consistency of shared memory accesses can be efficiently maintained in a ring-connected multiprocessor.
Abstract: 1 Abstract-The Express Ring is a new architecture under investigation at the University of Southern California. Its main goal is to demonstrate that a slotted unidirectional ring with very fast point-to-point interconnections can be at least ten times faster than a shared bus, using the same technology, and may be the topology of choice for future shared-memory multiprocessors. In this paper we introduce the Express Ring architecture and present a snooping cache coherence protocol for this machine. This protocol shows how consistency of shared memory accesses can be efficiently maintained in a ring-connected multiprocessor. We analyze the proposed protocol and compare it to other more usual alternatives for point-to-point connected machines, such as the SCI cache coherence protocol and directory based protocols.

Patent
31 Dec 1991
TL;DR: In this paper, a data processing system which includes a microprocessor fabricated on an integrated circuit chip, a main memory external to the integrated circuit (IC) chip, and a backside cache external to an IC chip is described.
Abstract: A data processing system which includes a microprocessor fabricated on an integrated circuit chip, a main memory external to the integrated circuit chip, and a backside cache external to the integrated circuit chip. The backside cache includes a directory RAM for storing cache address tag and encoded cache state bits. A first bus connects the microprocessor to the cache, the first bus including backside bus cache directory tags signals comprised of address bits used for a cache hit comparison in the directory RAM and backside bus cache directory state bits for determining a state encoding of a set in the directory RAM. A second bus connects the microprocessor to the main memory. The directory includes means for comparing the cache directory tags on the first bus with the tags stored in the directory and for asserting a Bmiss signal upon the condition that the directory tag stored in the backside bus cache directory do not match the backside bus cache directory tags signals. The microprocessor responds to the Bmiss signal by issuing the access onto the second bus in the event of a cache miss.

Patent
30 Dec 1991
TL;DR: In this paper, a Predictive Track Table (PTT) search is proposed to reduce cache write misses of CKD formatted DASD records by using a predictive track table to reduce host delays.
Abstract: A method for managing cache accessing of CKD formatted records that uses a Predictive Track Table to reduce host delays resulting from cache write misses. Because a significant portion of CKD formatted DASD tracks contain records having no key fields, identical logical and physical cylinder and head (CCHH) fields and similar-sized data fields, a compact description of such records by record count and length data, indexed by track, can be quickly searched to determine the physical track location of a record update that misses the cache. The Predictive Track Table search is much faster than the host wait state imposed by access and search of the DASD to read the missing track into cache. If the updated record that misses cache is found within the set of records in the Predictive Track Table, then the update may be immediately written to cache and to a Non-Volatile Store (NVS) without a DASD read access. This update then may be later destaged asynchronously to the DASD from either the cache or the NVS. Otherwise, if not found in a predictive track, the update record is written directly to the disk and the cache, subject to the LRU/MRU discipline, incurring the normal cache write-miss host wait state.

ReportDOI
01 May 1991
TL;DR: Results suggest that garbage collection algorithms will play an important part in improving cache performance as processor speeds increase and two-way set-associative caches are shown to reduce the miss rate in stop-and-copy algorithms often by a factor of two and sometimes by almost five over direct-mapped caches.
Abstract: : Cache performance is an important part of total performance in modern computer systems. This paper describes the use of trace-driven simulation to estimate the effect of garbage collection algorithms on cache performance Traces from four large Common Lisp programs have been collected and analyzed with an all-associatively cache simulator. While previous work has focused on the effect of garbage collection on page reference locality this evaluation unambiguously shows that garbage collection algorithms can have a profound effect on cache performance as well. On processors with a direct-mapped cache a generation stop-and-copy algorithm exhibits a miss rate up to four times higher than a comparable generation mark-and-sweep algorithm. Furthermore, two-way set-associative caches are shown to reduce the miss rate in stop-and-copy algorithms often by a factor of two and sometimes by a factor of almost five over direct-mapped caches. As processor speeds increase, cache performance will play an increasing role in total performance. These results suggest that garbage collection algorithms will play an important part in improving that performance.

Journal ArticleDOI
TL;DR: Two distributed directory schemes, the tree directory and the hierarchical full-map directory, to deal with the storage overhead problem are presented and should lend themselves to the design and implementation of large-scale cache coherent multiprocessors.
Abstract: Cache coherence problem is a major issue in the design of shared-memory multiprocessors. As the number of processors grows, traditional bus-based snoopy schemes for cache coherence are no longer adequate. Instead, the directory-based scheme is a promising alternative for the large-scale cache coherence problem. However, the storage overhead of (full-map) directory scheme may become too prohibitive as the system size goes up. This paper presents two distributed directory schemes, the tree directory and the hierarchical full-map directory, to deal with the storage overhead problem. Preliminary trace-driven evaluations show that the performance of our schemes compares favorably to the full-map directory scheme, while reducing the storage overhead by over 90%. These two schemes should lend themselves to the design and implementation of large-scale cache coherent multiprocessors.


Patent
19 Jun 1991
TL;DR: In this paper, the address bus bit field is configured based upon the dimensions of the cache memory and includes information as to where the data would be stored within cache memory. And the means for modifying the bit field generated by the cache control means are provided so that the cache subsystem may be readily configured to operate with different sized cache memories.
Abstract: A cache subsystem for a computer system which includes a cache memory and a cache control means. When the processor subsystem of the computer system requests data, information related to the location of the data within the memory subsystem of the computer system is input to the cache subsystem. The control means receives an address bus bit field and transmits control signals which vary depending on the received address bus bit field to the cache memory to look for the requested data. The address bus bit field is configured based upon the dimensions of the cache memory and includes information as to where the data would be stored within the cache memory. As different cache memories are of different dimensions, means for modifying the address bus bit field generated by the cache control means based on the dimensions of the cache memory are provided so that the cache subsystem may be readily configured to operate with different sized cache memories.

Patent
10 Jan 1991
TL;DR: In this paper, a data processing system (10) is provided having a secondary cache (34) for performing a deferred cache load, where the primary cache (26) is compared with the indexed entries in a primary cache, and the physical address corresponding to the single cache line stored in the secondary cache(34).
Abstract: A data processing system (10) is provided having a secondary cache (34) for performing a deferred cache load. The data processing system (10) has a pipelined integer unit (12) which uses an instruction prefetch unit (IPU) (12). The (IPU) (12) issues prefetch requests to a cache controller (22) and transfers a prefetch address to a cache address memory management unit (CAMMU) (24), for translation into a corresponding physical address. The physical address is compared with the indexed entries in a primary cache (26), and compared with the physical address corresponding to the single cache line stored in the secondary cache (34). When a prefetch miss occurs in both the primary (26) and the secondary cache (34), the cache controller (22) issues a bus transfer request to retrieve the requested cache line from an external memory (20). While a bus controller (16) performs the bus transfer, the cache controller (22) loads the primary cache (26) with the cache line currently stored in the secondary cache (34).

Patent
James E. Bohner1, Thang T. Do1, Richard J. Gusefski1, Kevin Huang1, Chon I. Lei1 
25 Feb 1991
TL;DR: In this paper, an inpage buffer is used between a cache and a slower storage device to provide data corresponding to subsequent requests, provided that the buffer is also able to contain such data.
Abstract: An inpage buffer is used between a cache and slower storage device. When a processor requests data, the cache is checked to see if the data is already in the cache. If not, a request for the data is sent to the slower storage device. The buffer receives the data from the slower storage device and provides the data to the processor that requested the data. The buffer then provides the data to the cache for storage provided that the cache is not working on a separate storage request from the processor. The data will be written into the cache from the buffer when the cache is free from such requests. The buffer is also able to provide data corresponding to subsequent requests provided it contains such data. This may happen if a request for the same data occurs, and the buffer has not yet written the data into the cache. It can also occur if the areas of the cache which can hold data from an area of the slower storage is inoperable for some reason. The buffer acts as a minicache when such a catastrophic error in the cache occurs.