scispace - formally typeset
Search or ask a question

Showing papers on "Cache published in 1991"


Proceedings ArticleDOI
01 Apr 1991
TL;DR: It is shown that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes.
Abstract: Blocking is a well-known optimization technique for improving the effectiveness of memory hierarchies. Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks, so that data loaded into the faster levels of the memory hierarchy are reused. This paper presents cache performance data for blocked programs and evaluates several optimization to improve this performance. The data is obtained by a theoretical model of data conflicts in the cache, which has been validated by large amounts of simulation. We show that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes. The conventional wisdom of frying to use the entire cache, or even a fixed fraction of the cache, is incorrect. If a fixed block size is used for a given cache size, the block size that minimizes the expected number of cache misses is very small. Tailoring the block size according to the matrix size and cache parameters can improve the average performance and reduce the variance in performance for different matrix sizes. Finally, whenever possible, it is beneficial to copy non-contiguous reused data into consecutive locations.

982 citations


Proceedings ArticleDOI
01 Aug 1991
TL;DR: In this article, a new hardware prefetching scheme based on the prediction of the execution of the instruction stream and associated operand references is proposed. But this scheme requires the use of a reference prediction table and its associated logic.
Abstract: Conventional cache prefetching approaches can be either hardware-based, generally by using a one-blockIookahead technique, or compiler-directed, with insertions of non-blocking prefetch instructions. We introduce a new hardware scheme based on the prediction of the execution of the instruction stream and associated operand references. It consists of a reference prediction table and a look-ahead program counter and its associated logic. With this scheme, data with regular access patterns is preloaded, independently of the stride size, and preloading of data with irregular access patterns is prevented. We evaluate our design through trace driven simulation by comparing it with a pure data cache approach under three different memory access models. Our experiments show that this scheme is very effective for reducing the data access penalty for scientific programs and that is has moderate success for other applications.

458 citations


Patent
31 Dec 1991
TL;DR: In this paper, the authors present a fault-tolerant storage device array using a copyback cache storage unit for temporary storage, where Write data is copied from a controller buffer to a reserved area of each storage unit comprising the array.
Abstract: A fault-tolerant storage device array using a copyback cache storage unit for temporary storage. When a Write occurs to the RAID system, the data is immediately written to the first available location in the copyback cache storage unit. Upon completion of the Write to the copyback cache storage unit, the host CPU is immediately informed that the Write was successful. Thereafter, further storage unit accesses by the CPU can continue without waiting for an error-correction block update for the data just written. In a first embodiment of the invention, Read-Modify-Write operations are performed during idle time. In a second embodiment of the invention, normal Read-Modify-Write operation by the RAID system controller continue use Write data in the controller's buffer memory. In a third embodiment, at least two controllers, each associated with one copyback cache storage unit, copy Write data from controller buffers to the associated copyback cache storage unit. If a copyback cache storage unit fails, more than one controller share a single copyback storage unit. In a fourth embodiment, Write data is copied from a controller buffer to a reserved area of each storage unit comprising the array.

356 citations


Proceedings ArticleDOI
01 Apr 1991
TL;DR: The LimitLESS scheme uses a combination of hardware and software techniques to realize the performance of a full-map directory with the memory overhead of a limited directory, supported by Alewife, a large-scale multiprocessor.
Abstract: Caches enhance the performance of multiprocessors by reducing network traffic and average memory access latency. However, cache-based systems must address the problem of cache coherence. We propose the LimitLESS directory protocol to solve this problem. The LimitLESS scheme uses a combination of hardware and software techniques to realize the performance of a full-map directory with the memory overhead of a limited directory. This protocol is supported by Alewife, a large-scale multiprocessor. We describe the architectural interfaces needed to implement the LimitLESS directory, and evaluate its performance through simulations of the Alewife machine.

348 citations


Journal ArticleDOI
TL;DR: The results show that for applications with regular data access patterns—the authors evaluate a particle-based simulator used in aeronautics and an LU-decomposition application—prefetching can be very effective, and the performance of a distributed-time logic simulation application that made extensive use of pointers and linked lists could be increased by only 30%.

318 citations


Patent
06 Feb 1991
TL;DR: The branch prediction cache (BPC) as mentioned in this paper provides a tag identifying the address of instructions causing a branch, a record of the target address which was branched to on the last occurrence of each branch instruction, and a copy of the first several instructions beginning at this target address.
Abstract: The present invention provides for the updating of both the instructions in a branch prediction cache and instructions recently provided to an instruction pipeline from the cache when an instruction being executed attempts to change such instructions ("Store-Into-Instruction-Stream"). The branch prediction cache (BPC) includes a tag identifying the address of instructions causing a branch, a record of the target address which was branched to on the last occurrence of each branch instruction, and a copy of the first several instructions beginning at this target address. A separate instruction cache is provided for normal execution of instructions, and all of the instructions written into the branch prediction cache from the system bus must also be stored in the instruction cache. The instruction cache monitors the system bus for attempts to write to the address of an instruction contained in the instruction cache. Upon such a detection, that entry in the instruction cache is invalidated, and the corresponding entry in the branch prediction cache is invalidated. A subsequent attempt to use an instruction in the branch prediction cache which has been invalidated will detect that it is not valid, and will instead go to main memory to fetch the instruction, where it has been modified.

279 citations


Proceedings ArticleDOI
01 Apr 1991
TL;DR: This work fed address traces of the processes running on a multi-tasking operating system through a cache simulator, to compute accurate cache-hit rates over short intervals, and estimated the cache performance reduction caused by a context switch.
Abstract: The sustained performance of fast processors is critically dependent on cache performance. Cache performance in turn depends on locality of reference. When an operating system switches contexts, the assumption of locality may be violated because the instructions and data of the newly-scheduled process may no longer be in the cache(s). Context-switching thus has a cost above that associated with that of the operations performed by the kernel. We fed address traces of the processes running on a multi-tasking operating system through a cache simulator, to compute accurate cache-hit rates over short intervals. By marking the output of such a simulation whenever a context switch occurs, and then aggregating the post-context-switch results of a large number of context switches, it is possible to estimate the cache performance reduction caused by a switch. Depending on cache parameters the net cost of a context switch appears to be in the thousands of cycles, or tens to hundreds of microseconds. This technical note is a preprint of a paper to appear in the Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.

272 citations


Journal ArticleDOI
Sean Quinlan1
TL;DR: A general‐purpose file system that uses a write‐once‐read‐many (WORM) optical disk accessed via a magnetic disk cache that enables blocks to be modified a number of times before they are written to the WORM and increases performance.
Abstract: This paper describes a general-purpose file system that uses a write-once-read-many (WORM) optical disk accessed via a magnetic disk cache. The cache enables blocks to be modified a number of times before they are written to the WORM and increases performance. Snapshots of the file system can be made at any time without limiting the users' access to files. These snapshots reside entirely on the WORM, are accessible to the user via a second read-only file system, do not contain multiple copies of unchanged data, and can be used to rebuild the file system in the event that the disk cache is destroyed. The file system has been implemented as part of Plan 9, an experimental operating system under development at AT&T Bell Laboratories.

255 citations


Patent
30 Aug 1991
TL;DR: In this paper, prefetches to a cache memory subsystem are made from predictions which are based on access patterns stored by context, which serve in making future predictions and to identify statistics such as pattern accuracy for each unit pattern.
Abstract: Prefetches to a cache memory subsystem are made from predictions which are based on access patterns stored by context. An access pattern is generated from prior accesses of a data processing system processing in a like context. During a training sequence, an actual trace of memory accesses is processed to generate unit patterns which serve in making future predictions and to identify statistics such as pattern accuracy for each unit pattern. In a replacement list, prefetched objects are included at the head of the list. Within a prefetch, objects are listed by order of expected time of access, with alternatives at predicted times of access. When an object is used, it is moved to the head of the list and any prefetched alternatives to that object, indicated by like time marks, are moved to the tail of the list. Alternatives may be listed according to degree of match of a current access pattern and a stored access pattern and by prior accuracy of the unit pattern. A server includes a demand access queue which preempts fetches of objects identified by a prefetch queue.

252 citations


Proceedings Article
03 Sep 1991
TL;DR: FIDO is described, an experimental {\em predictive cache} that predicts access for individuals during a session by employing an associative memory to assimilate regularities in the access pattern of an individual over time, and it is concluded that predictive caching holds great promise.
Abstract: Accurately fetching data objects or pages in advance of their use is a powerful means of improving performance, but this capability has been difficult to realize. Current OODBs maintain object caches that employ fetch and replacement policies derived from those used for virtual-memory demand paging. These policies usually assume no knowledge of the future. Object cache managers often employ demand fetching combined with data clustering to effect prefetching, but cluster prefetching can be ineffective when the access patterns serviced are incompatible. This paper describes FIDO, an experimental {\em predictive cache} that predicts access for individuals during a session by employing an associative memory to assimilate regularities in the access pattern of an individual over time. By dint of continual training, the associative memory adapts to changes in the database and in the user''s access pattern, enabling on-line access predictions for prefetching. We discuss two salient components of Fido: \begin{enumerate} \item MLP, a replacement policy for managing pre-fetched objects. \item Estimating Prophet, an associative memory that recognizes patterns in access sequences adaptively over time and provides on-line predictions used for prefetching. \end{enumerate} We then present some early simulation thatts which suggest that predictive caching works well, especially for sequential access patterns, and conclude that predictive caching holds great promise.

247 citations


Proceedings ArticleDOI
02 Apr 1991
TL;DR: This paper uses detailed simulation studies to evaluate the performance of several different scheduling strategies, and shows that in situations where the number of processes exceeds thenumber of processors, regular priority-based scheduling in conjunction with busy-waiting synchronization primitives results in extremely poor processor utilization.
Abstract: Shared-memory multiprocessors are frequently used as compute servers with multiple parallel applications executing at the same time. In such environments, the efficiency of a parallel application can be significantly affected by the operating system scheduling policy. In this paper, we use detailed simulation studies to evaluate the performance of several different scheduling strategies, These include regular priority scheduling, coscheduling or gang scheduling, process control with processor partitioning, handoff scheduling, and affinity-based scheduling. We also explore tradeoffs between the use of busy-waiting and blocking synchronization primitives and their interactions with the scheduling strategies. Since effective use of caches is essential to achieving high performance, a key focus is on the impact of the scheduling strategies on the caching behavior of the applications.Our results show that in situations where the number of processes exceeds the number of processors, regular priority-based scheduling in conjunction with busy-waiting synchronization primitives results in extremely poor processor utilization. In such situations, use of blocking synchronization primitives can significantly improve performance. Process control and gang scheduling strategies are shown to offer the highest performance, and their performance is relatively independent of the synchronization method used. However, for applications that have sizable working sets that fit into the cache, process control performs better than gang scheduling. For the applications considered, the performance gains due to handoff scheduling and processor affinity are shown to be small.

Proceedings ArticleDOI
01 Apr 1991
TL;DR: This paper presents a range of lock-based cache consistency algorithms that arise by viewing cache consistency as aiant of the well-understood problem of replicated data management, and uses a detailed simulation model to study the performance of these algorithm over a wide range of workloads end system resource configurations.
Abstract: In this paper, we examine the performance tradeoffs that are raised by caching data in the client workstations of a client-server DBMS. We begin by presenting a range of lock-based cache consistency algorithms that arise by viewing cache consistency as a v~iant of the well-understood problem of replicated data management. We then use a detailed simulation model to study the performance of these algorithm over a wide range of workloads end system resource configurations. The results illustrate the key performance tradeoffs related to clientserver cache consistency, and should be of use to designers of next-generation DBMS prototypes and products.

Proceedings ArticleDOI
01 Apr 1991
TL;DR: Multi-port, nonblocking (MPNB) L1 caches introduced in this paper for the top of the data memory hierarchy appear to be capable of supporting the bandwidth demands of futuregeneration superscalar processors.
Abstract: This paper considers the design of a data memory hierarchy, with a level 1 (L1) data cache at the top, to support the data bandwidth demands of a future-generation superscalar processor capable of issuing about ten instructions per clock cycle. It introduces the notion of cache bandwidfh — the bandwidth with which a cache can accept requests from the processor — and shows how the bandwidth of a standard, blocking cache, can degrade greatly because of its inability to overlap the service of misses. Non-blocking or lockup-free caches are discussed as a way of reducing the bandwidth degradation due to misses. To improve the data bandwidth to greater than 1 request per cycle, multi-port, interleaved caches are introduced. Simulation results from a cycle-by-cycle simulator, using the MIPS R2000 instruction set, suggest that memory hierarchies with blocking L 1 caches will be unable to support the bandwidth demands of futuregeneration superscalar processors. Multi-port, nonblocking (MPNB) L1 caches introduced in this paper for the top of the data memory hierarchy appear to be capable of supporting such data bandwidth demands.

Patent
16 May 1991
TL;DR: In this paper, a microprocessor partially decodes instructions retrieved from main memory before placing them into the microprocessor's integrated instruction cache, each storage location in the instruction cache includes two slots for decoded instructions.
Abstract: A microprocessor partially decodes instructions retrieved from main memory before placing them into the microprocessor's integrated instruction cache. Each storage location in the instruction cache includes two slots for decoded instructions. One slot controls one of the microprocessor's integer pipelines and a port to the microprocessor's data cache. A second slot controls the second integer pipeline or one of the microprocessor's floating point units. The instructions retrieved from main memory are decoded by a loader unit which decodes the instructions from the compact form as stored in main memory and places them into the two slots of the instruction cache entry according to their functions. In addition, auxiliary information is placed in the cache entry along with the instruction to control parallel execution as well as emulation of complex instructions. A bit in each instruction cache entry indicates whether the instructions in the two slots are independent, so that they can be executed in parallel, or dependent, so that they must be executed sequentially. Using a single bit for this purpose allows two dependent instructions to be stored in the slots of the single cache entry.

Journal ArticleDOI
TL;DR: An area model suitable for comparing data buffers of different organizations and arbitrary sizes is described and it is shown that, comparing caches and register files in terms of area for the same storage capacity, caches generally occupy more area per bit than register files for small caches because the overhead dominates the cache area at these sizes.
Abstract: An area model suitable for comparing data buffers of different organizations (e.g. caches versus register files) and arbitrary sizes is described. The area model considers the supplied bandwidth of a memory cell and includes such buffer overhead as control logic, driver logic and tag storage. The model gave less than 10% error when verified against real caches and register files. It is shown that, comparing caches and register files in terms of area for the same storage capacity, caches generally occupy more area per bit than register files for small caches because the overhead dominates the cache area at these sizes. For larger caches, the smaller storage cells in the cache provide a smaller total cache area per bit than the register set. Studying cache performance (traffic ratio) as a function of area, it is shown that, for small caches (less than the area occupied by 256 registers bits-r.b.e.-or 32 b), direct-mapped caches perform significantly better than four-way set-associative caches and, for caches of medium areas (between 256 r.b.e. and 4096 r.b.e.), both direct-mapped and set-associative caches perform better than fully associative caches. >

Book ChapterDOI
07 Aug 1991
TL;DR: It is shown how to estimate efficiently the number of distinct cache lines used by a given loop in a nest of loops, and this estimate can be used to guide program transformations such as loop interchange to achieve greater cache effectiveness.
Abstract: In this paper, we consider automatic analysis of a program's cache usage to achieve greater cache effectiveness. We show how to estimate efficiently the number of distinct cache lines used by a given loop in a nest of loops. Given this estimate of the number of cache lines needed, we can estimate the number of cache misses for a nest of loops. Our estimates can be used to guide program transformations such as loop interchange to achieve greater cache effectiveness. We present simulation results that show our estimates are reasonable for simple cases such as matrix multiply. We analyze the array sizes for which our estimates differ from our simulation results, and provide recommendations on how to handle such arrays in practice.

Patent
16 May 1991
TL;DR: In this article, a microprocessor partially decodes instructions retrieved from main memory before placing them into the microprocessor's integrated instruction cache, each storage location in the instruction cache includes two slots for decoded instructions.
Abstract: of EP0459232A microprocessor partially decodes instructions retrieved from main memory before placing them into the microprocessor's integrated instruction cache. Each storage location in the instruction cache includes two slots for decoded instructions. One slot controls one of the microprocessor's integer pipelines and a port to the microprocessor's data cache. A second slot controls the second integer pipeline or one of the microprocessor's floating point units. The instructions retrieved from main memory are decoded by a loader unit which decodes the instructions from the compact form as stored in main memory and places them into the two slots of the instruction cache entry according to their functions. In addition, auxiliary information is placed in the cache entry along with the instruction to control parallel execution as well as emulation of complex instructions. A bit in each instruction cache entry indicates whether the instructions in the two slots are independent, so that they can be executed in parallel, or dependent, so that they must be executed sequentially. Using a single bit for this purpose allows two dependent instructions to be stored in the slots of the single cache entry.

Proceedings ArticleDOI
01 Apr 1991
TL;DR: This paper reports the cache performance of a set of vectorized numerical program from the Perfect Club benchmarks and describes two simple prefetch schemes to reduce the influence of long stride vector accesses and misses due IO block invalidations in mulliprocessor vector caches.
Abstract: This paper reports the cache performance of a set of vectorized numerical program from the Perfect Club benchmarks. Using a low cost trace driven simularion technique we show how a non-prefetching vector cache can result in unpredictable performance and how rhis unpredictability makes it difficult to find a good block size. We describe two simple prefetch schemes to reduce the influence of long stride vector accesses and misses due IO block invalidations in mulliprocessor vector caches. These two schemes are shown to have better performance than a non-prefetching cache.

Proceedings ArticleDOI
01 May 1991
TL;DR: This paper describes a method of determining which procedures to merge for machines with instruction caches that uses profile information, the structure of the program, the cache size, and the cache miss penalty to guide the choice.
Abstract: This paper describes a method of determining which procedures to merge for machines with instruction caches. The method uses profile information, the structure of the program, the cache size, and the cache miss penalty to guide the choice. Optimization for the cache is assumed to follow procedure merging. The method weighs the benefit of removing calls with the increase in the instruction cache miss rate. Better performance is achieved than previous schemes that do not consider the cache. Merging always results in a savings, unlike simpler schemes that can make programs slower once cache effects are considered. The new method also has better performance even when parameters to simpler algorithms are varied to get the best performance. This report is a preprint of a paper that will be presented at the ACM SIGPLAN ’91 Conference on Programming Language Design and Implementation, Toronto, Ontario, Canada, June 26-28, 1991. Copyright  1990 ACM.

Proceedings ArticleDOI
01 Sep 1991
TL;DR: It is concluded that on current machines processor affinity has only a very weak influence on the choice of scheduling discipline, and that the benefits of frequent processor reallocation (in response to the changing parallelism of jobs) outweigh the penalties imposed by such reallocated.
Abstract: In a shared memory multiprocessor with caches, executing tasks develop "affinity" to processors by filling their caches with data and instructions during execution. A scheduling policy that ignores this affinity may waste processing power by causing excessive cache refilling.Our work focuses on quantifying the effect of processor reallocation on the performance of various parallel applications multiprogrammed on a shared memory multiprocessor, and on evaluating how the magnitude of this cost affects the choice of scheduling policy.We first identify the components of application response time, including processor reallocation costs. Next, we measure the impact of reallocation on the cache behavior of several parallel applications executing on a Sequent Symmetry multiprocessor. We also measure, the performance of these applications under a number of alternative allocation policies. These experiments lead us to conclude that on current machines processor affinity has only a very weak influence on the choice of scheduling discipline, and that the benefits of frequent processor reallocation (in response to the changing parallelism of jobs) outweigh the penalties imposed by such reallocation. Finally, we use this experimental data to parameterize a simple analytic model, allowing us to evaluate the effect of processor affinity on future machines, those containing faster processors and larger caches.

Patent
Richard Lewis Mattson1
20 May 1991
TL;DR: In this paper, the cache hierarchy is logically partitioned to form a least recently used (LRU) global cache and a plurality of LRU destaging local caches, where each local cache is bound to objects having a unique data type T(i), where i is indicative of a DataType.
Abstract: A method for managing a cache hierarchy having a fixed total storage capacity is disclosed. The cache hierarchy is logically partitioned to form a least recently used (LRU) global cache and a plurality of LRU destaging local caches. The global cache stores objects of all types and maintains them in LRU order. In contrast, each local cache is bound to objects having a unique data type T(i), where i is indicative of a DataType. Read and write accesses by referencing processors or central processing units (CPU's) are made to the global cache. Data not available in the global cache is staged thereto either from one of the local caches or from external storage. When a cache full condition is reached, placement of the most recently used (MRU) data element to the top of the global cache results in an LRU data element of type T(i) being destaged from the global cache to a corresponding one of the local caches storing type T(i) data. Likewise, when a cache full condition is reached in any one or more of the local caches, the local caches in turn will destage their LRU data elements to external storage. The parameters defining the partitions are externally supplied.

Patent
19 Aug 1991
TL;DR: In this paper, a scheduler allocates engines to processes and schedules the processes to run on the basis of priority and engine availability, which increases computing system performance and reduces bus traffic.
Abstract: A computing system (50) includes N number of symmetrical computing engines having N number of cache memories joined by a system bus (12). The computing system includes a global run queue (54), an FPA global run queue, and N number of affinity run queues (58). Each engine is associated with one affinity run queue, which includes multiple slots. When a process first becomes runnable, it is typically attached one of the global run queues. A scheduler allocates engines to processes and schedules the processes to run on the basis of priority and engine availability. An engine typically stops running a process before it is complete. When the process becomes runnable again the scheduler estimates the remaining cache context for the process in the cache of the engine. The scheduler uses the estimated amount of cache context in deciding in which run queue a process is to be enqueued. The process is enqueued to the affinity run queue of the engine when the estimated cache context of the process is sufficiently high, and is enqueued onto the global run queue when the cache context is sufficiently low. The procedure increases computing system performance and reduces bus traffic because processes will run on engines having sufficient cache affinity, but will also run on the best available engine when there is insufficient cache context.

Patent
Kevin Frank Smith1
18 Nov 1991
TL;DR: In this article, the signed difference between an assigned and actual hit ratio is used to dynamically allocate cache size to a class of data and its correlation with priority of that class, and cache space is dynamically reallocated among the partitions in a direction that forces the weighted hit rate slope for each partition into equality over the entire cache.
Abstract: Methods for managing a Least Recently Used (LRU) cache in a staged storage system on a prioritized basis permitting management of data using multiple cache priorities assigned at the data set level. One method uses the signed difference between an assigned and actual hit ratio to dynamically allocate cache size to a class of data and its correlation with priority of that class. Advantageously, the hit ratio performance of a class of data can not be degraded beyond a predetermined level by a class of data having lower priority. Another method dynamically reallocates cache space among partitions asymptotically to the general equality of a weighted hit rate slope function attributable to each partition. This function is the product of the slope of a weighting factor and the partition hit rate versus partition space allocation function. Cache space is dynamically reallocated among the partitions in a direction that forces the weighted hit rate slope for each partition into equality over the entire cache. Partition space is left substantially unchanged in those partitions where the weighted hit rate slope is non-positive. Hit rate is defined as hit ratio times I/O rate.

Patent
20 Aug 1991
TL;DR: In this paper, a multilevel cache buffer for a multiprocessor system is described, where each processor has a level one cache storage unit which interfaces with a level two cache unit and main storage unit shared by all processors.
Abstract: A multilevel cache buffer for a multiprocessor system in which each processor has a level one cache storage unit which interfaces with a level two cache unit and main storage unit shared by all processors. The multiprocessors share the level two cache according to a priority algorithm. When data in the level two cache is updated, corresponding data in level one caches is invalidated until it is updated.

Patent
30 Aug 1991
TL;DR: In this article, a method and apparatus for incorporating cache line replacement and cache write policy information into the tag directories in a cache system is presented, which can be generalized to caches which include a number of ways greater than two by using a pseudo-LRU algorithm and utilizing group select bits in each way to distinguish between least recently used groups.
Abstract: A method and apparatus for incorporating cache line replacement and cache write policy information into the tag directories in a cache system. In a 2 way set-associative cache, one bit in each way's tag RAM is reserved for LRU information, and the bits are manipulated such that the Exclusive-OR of each way's bits points to the actual LRU cache way. Since all of these bits must be read when the cache controller determines whether a hit or miss has occurred, the bits are available when a cache miss occurs and a cache line replacement is required. The method can be generalized to caches which include a number of ways greater than two by using a pseudo-LRU algorithm and utilizing group select bits in each of the ways to distinguish between least recently used groups. Cache write policy information is stored in the tag RAM's to designate various memory areas as write-back or write-through. In this manner, system memory situated on an I/O bus which does not recognize inhibit cycles can have its data cached.

Patent
09 Jul 1991
TL;DR: In this paper, a method for data consistency between a plurality of individual processor cache memories and the main memory in a multi-processor computer system is provided which is capable of detecting when one of a set of predefined data inconsistency states occurs as a data transaction request is being processed, and correcting the data inconsistencies states so that the operation may be executed in a correct and consistent manner.
Abstract: A method for insuring data consistency between a plurality of individual processor cache memories and the main memory in a multi-processor computer system is provided which is capable of (1) detecting when one of a set of predefined data inconsistency states occurs as a data transaction request is being processed, and (2) correcting the data inconsistency states so that the operation may be executed in a correct and consistent manner. In particular, the method is adapted to address two kinds of data inconsistency states: (1) A request for a operation from a system unit to main memory when the location to be written to is present in the cache of some processor unit-in such a case, data in the cache is "stale" and the data inconsistency is avoided by preventing the associated processor from using the "stale" data; and (2) when a read operation is requested of main memory by a system unit and the location to be read may be written or has already been written in the cache of some processor--in this case, the data in main memory is "stale" and the data inconsistency is avoided by insuring that the data returned to the requesting unit is the updated data in the cache. The presence of one of the above-described data inconsistency states is detected in a SCU-based multi-processing system by providing the SCU with means for maintaining a copy of the cache directories for each of the processor caches. The SCU continually compares address data accompanying memory access requests with what is stored in the SCU cache directories in order to determine the presence of predefined conditions indicative of data inconsistencies, and subsequently executes corresponding predefined fix-up sequences.

Patent
12 Apr 1991
TL;DR: In this article, a method of dynamically prefetching data for a cache memory is controlled by the past history of data requests, and the prefetch algorithm is limited at eight blocks, each additional sequential request less than all of which is already in the cache will cause eight blocks to be prefetched.
Abstract: A method of dynamically prefetching data for a cache memory is controlled by the past history of data requests. If the previous fetch and current fetch request are not sequential, no data is prefetched. If the previous fetch and current fetch request are sequential and less than all of the current fetch request is already in the cache, two blocks of data sequentially beyond the current fetch request are prefetched. If the previous two blocks fetched and current fetch request are sequential and less than all of the current fetch request is already in the cache, four blocks of data sequentially beyond the current fetch request are prefetched. If the previous three blocks fetched and the current fetch request are sequential and less than all of the current fetch request is already in the cache, eight blocks of data sequentially beyond the current fetch request are preferred. The prefetch algorithm is limited at eight blocks. Each additional sequential request less than all of which is already in the cache will cause eight blocks to be prefetched.

Patent
30 May 1991
TL;DR: In this article, a method and apparatus for storing an instruction word in a compacted form on a storage media, the instruction word having a plurality of instruction fields, features associating with each instruction word, a mask word having length in bits at least equal to the number of instructions in the instruction words, is presented.
Abstract: A method and apparatus for storing an instruction word in a compacted form on a storage media, the instruction word having a plurality of instruction fields, features associating with each instruction word, a mask word having a length in bits at least equal to the number of instruction fields in the instruction word. Each instruction field is associated with a bit of the mask word and accordingly, using the mask word, only non-zero instruction fields need to be stored in memory. The instruction compaction method is advantageously used in a high speed cache miss engine for refilling portions of instruction cache after a cache miss occurs.

Patent
Leslie D. Kohn1
08 Aug 1991
TL;DR: In this paper, a microprocessor with a pipelined architecture, an on-chip data cache, a floating-point unit, a data latch and an instruction for accessing infrequently used data from an external memory system is disclosed.
Abstract: A microprocessor having a pipelined architecture, an onchip data cache, a floating-point unit, a floating-point data latch and an instruction for accessing infrequently used data from an external memory system is disclosed The instruction comprises a first-in-first-out memory for accumulating data in a pipeline manner, a first circuit means for coupling data from the external bus to the first-in-first-out memory and a second circuit means for transferring the data stored in the first-in-first-out memory to the floating-point data latch The second circuit means also couples data from the cache to the first-in-first-out memory in the event of a cache hit Finally, a bus control means is provided for controlling the orderly flow of data in accordance with the architecture of the microprocessor

Proceedings ArticleDOI
01 Apr 1991
TL;DR: This paper proposes an efficient cachebased access anomaly detection scheme that piggybacks on the overhead already paid by the underlying cache coherence protocol.
Abstract: One important issue in parallel program debugging is the efficient detection of access anomalies caused by uncoordinated accesses to shared variables. On-the-fly detection of access anomalies has two advantages over static analysis or post-mortem trace analysis. First, it reports only actual anomalies during execution. Second, it produces shorter traces for post-mortem analysis purposes if an anomaly is detected, since generating further trace information after the detection of an anomaly is of dubious value. Existing methods for on-the-fly access anomaly detection suffer from performance penalties since the execution of the program being debugged has to be interrupted on every access to shared variables. In this paper, we propose an efficient cachebased access anomaly detection scheme that piggybacks on the overhead already paid by the underlying cache coherence protocol.