scispace - formally typeset
Search or ask a question

Showing papers on "Cache coloring published in 1992"


Book ChapterDOI
01 Jan 1992
TL;DR: As multiprocessors are scaled beyond single bus systems, there is renewed interest in directory-based cache coherence schemes that use a limited number of pointers per directory entry to keep track of all processors caching a memory block.
Abstract: As multiprocessors are scaled beyond single bus systems, there is renewed interest in directory-based cache coherence schemes. These schemes rely on a directory to keep track of all processors caching a memory block. When a write to that block occurs, point-to-point invalidation messages are sent to keep the caches coherent. A straightforward way of recording the identities of processors caching a memory block is to use a bit vector per memory block, with one bit per processor. Unfortunately, when the main memory grows linearly with the number of processors, the total size of the directory memory grows as the square of the number of processors, which is prohibitive for large machines. To remedy this problem several schemes that use a limited number of pointers per directory entry have been suggested. These schemes often cause excessive invalidation traffic.

321 citations


Journal ArticleDOI
TL;DR: This work develops several page placement algorithms, called careful-mapping algorithms, that try to select a page frame from a pool of available page frames that is likely to reduce cache contention.
Abstract: When a computer system supports both paged virtual memory and large real-indexed caches, cache performance depends in part on the main memory page placement. To date, most operating systems place pages by selecting an arbitrary page frame from a pool of page frames that have been made available by the page replacement algorithm. We give a simple model that shows that this naive (arbitrary) page placement leads to up to 30% unnecessary cache conflicts. We develop several page placement algorithms, called careful-mapping algorithms, that try to select a page frame (from the pool of available page frames) that is likely to reduce cache contention. Using trace-driven simulation, we find that careful mapping results in 10–20% fewer (dynamic) cache misses than naive mapping (for a direct-mapped real-indexed multimegabyte cache). Thus, our results suggest that careful mapping by the operating system can get about half the cache miss reduction that a cache size (or associativity) doubling can.

289 citations


Journal ArticleDOI
Harold S. Stone1, John Turek1, Joel L. Wolf1
TL;DR: An efficient combinatorial algorithm for determining the optimal steady-state allocation, which, in theory, could be used to reduce the length of the transient, is described and generalizes to multilevel cache memories.
Abstract: A model for studying the optimal allocation of cache memory among two or more competing processes is developed and used to show that, for the examples studied, the least recently used (LRU) replacement strategy produces cache allocations that are very close to optimal. It is also shown that when program behavior changes, LRU replacement moves quickly toward the steady-state allocation if it is far from optimal, but converges slowly as the allocation approaches the steady-state allocation. An efficient combinatorial algorithm for determining the optimal steady-state allocation, which, in theory, could be used to reduce the length of the transient, is described. The algorithm generalizes to multilevel cache memories. For multiprogrammed systems, a cache-replacement policy better than LRU replacement is given. The policy increases the memory available to the running process until the allocation reaches a threshold time beyond which the replacement policy does not increase the cache memory allocated to the running process. >

212 citations


Proceedings ArticleDOI
01 Jun 1992
TL;DR: MemSpy is described, a prototype tool that helps programmers identify and fix memory bottlenecks in both sequential and parallel programs and introduces the notion of data oriented, in addition to code oriented, performance tuning.
Abstract: To cope with the increasing difference between processor and main memory speeds, modern computer systems use deep memory hierarchies. In the presence of such hierarchies, the performance attained by an application is largely determined by its memory reference behavior—if most references hit in the cache, the performance is significantly higher than if most references have to go to main memory. Frequently, it is possible for the programmer to restructure the data or code to achieve better memory reference behavior. Unfortunately, most existing performance debugging tools do not assist the programmer in this component of the overall performance tuning task.This paper describes MemSpy, a prototype tool that helps programmers identify and fix memory bottlenecks in both sequential and parallel programs. A key aspect of MemSpy is that it introduces the notion of data oriented, in addition to code oriented, performance tuning. Thus, for both source level code objects and data objects, MemSpy provides information such as cache miss rates, causes of cache misses, and in multiprocessors, information on cache invalidations and local versus remote memory misses. MemSpy also introduces a concise matrix presentation to allow programmers to view both code and data oriented statistics at the same time. This paper presents design and implementation issues for MemSpy, and gives a detailed case study using MemSpy to tune a parallel sparse matrix application. It shows how MemSpy helps pinpoint memory system bottlenecks, such as poor spatial locality and interference among data structures, and suggests paths for improvement.

193 citations


Journal ArticleDOI
01 Sep 1992
TL;DR: This work describes the design, implementation and evaluation of a virtual memory system that provides application control of physical memory using external page-cache management and claims that this approach can significantly improve performance for many memory-bound applications while reducing kernel complexity, yet does not complicate other applications or reduce their performance.
Abstract: Next generation computer systems will have gigabytes of physical memory and processors in the 100 MIPS range or higher. Contrary to some conjectures, this trend requires more sophisticated memory management support for memory-bound computations such as scientific simulations and systems such as large-scale database systems, even though memory management for most programs will be less of a concern. We describe the design, implementation and evaluation of a virtual memory system that provides application control of physical memory using external page-cache management. In this approach, a sophisticated application is able to monitor and control the amount of physical memory it has available for execution, the exact contents of this memory, and the scheduling and nature of page-in and page-out using the abstraction of a physical page cache provided by the kernel. We claim that this approach can significantly improve performance for many memory-bound applications while reducing kernel complexity, yet does not complicate other applications or reduce their performance.

178 citations


Journal ArticleDOI
TL;DR: An analytical access time model for on-chip cache memories that shows the dependence of the cache access time on the cache parameters is described and it is shown that for given C, B, and A, optimum array configuration parameters can be used to minimize the access time.
Abstract: An analytical access time model for on-chip cache memories that shows the dependence of the cache access time on the cache parameters is described. The model includes general cache parameters, such as cache size (C), block size (B), and associativity (A), and array configuration parameters that are responsible for determining the subarray aspect ratio and the number of subarrays. With this model, a large cache design space can be covered, which cannot be done using only SPICE circuit simulation within a limited time. Using the model, it is shown that for given C, B, and A, optimum array configuration parameters can be used to minimize the access time; if the optimum array parameters are used, then the optimum access time is roughly proportional to the log (cache size), and when the optimum array parameters are used, larger block size gives smaller access time, but larger associativity does not give smaller access time because of the increase of the data-bus capacitances. >

136 citations


Journal ArticleDOI
TL;DR: An adaptive algorithm for managing fully associative cache memories shared by several identifiable processes is presented and it is shown that such an increase in hit-ratio in a system with a heavy throughput of I/O requests can provide a significant decrease in disk response time.
Abstract: An adaptive algorithm for managing fully associative cache memories shared by several identifiable processes is presented. The on-line algorithm extends an earlier model due to H.S. Stone et al. (1989) and partitions the cache storage in disjoint blocks whose sizes are determined by the locality of the processes accessing the cache. Simulation results of traces for 32-MB disk caches show a relative improvement in the overall and read hit-ratios in the range of 1% to 2% over those generated by a conventional least recently used replacement algorithm. The analysis of a queuing network model shows that such an increase in hit-ratio in a system with a heavy throughput of I/O requests can provide a significant decrease in disk response time. >

130 citations


Patent
05 Jun 1992
TL;DR: In this article, the posted write cache is further mirrored and parity-checked to assure data validity, and a mirror test is executed to verify data validity when posted write operations are only enabled when error free operation is assured.
Abstract: A host computer including a posted write cache for a disk drive system where the posted write cache includes battery backup to protect against potential loss of data in case of a power failure, and also including means for performing a method for determining if live data is present in the posted write cache upon power-up. The posted write cache is further mirrored and parity-checked to assure data validity. Performance increase is achieved since during normal operation data is written to the much faster cache and a completion indication is returned, and the data is flushed to the slower disk drive system at a more opportune time. Batteries provide power to the posted write cache in the event of a power failure. Upon subsequent power-up, a cache signature previously written in the posted write cache indicates that live data still resides in the posted write cache. If the cache signature is not present and the batteries are not fully discharged, a normal power up condition exists. If the cache signature is not present and the batteries are fully discharged, then the user is warned of possible data loss. A configuration identification code assures a proper correspondence between the posted write cache board and the disk drive system. A mirror test executed to verify data validity. Temporary and permanent error conditions are monitored so that posted write operations are only enabled when error-free operation is assured.

129 citations


Patent
30 Mar 1992
TL;DR: In this article, a high-speed cache is shared by a plurality of independently-operating data systems in a multi-system data sharing complex, where each data system has access both to the highspeed cache and the lower-speed, secondary storage for obtaining and storing data.
Abstract: A high-speed cache is shared by a plurality of independently-operating data systems in a multi-system data sharing complex. Each data system has access both to the high-speed cache and the lower-speed, secondary storage for obtaining and storing data. Management logic and the high-speed cache assures that a block of data obtained form the cache for entry into the secondary storage will be consistent with the version of the block of data in the shared cache with non-blocking serialization allowing access to a changed version in the cache while castout is being performed. Castout classes are provided to facilitate efficient movement from the shared cache to DASD.

129 citations


Patent
22 Jun 1992
TL;DR: In this article, a cache coherency request is received from another processor, and the address of the request is compared to addresses stored in the content addressable memory, and when there is a match, a bit in the matching entry is set to indicate a delayed request that is executed after the lock is unlocked or the cache is refilled.
Abstract: A processor and method for preventing access to a locked memory block in a multiprocessor computer system. The processor has a cache memory and records a memory lock in a content-addressable memory separate from the cache memory. Preferably, outstanding cache fills are recorded in the same content addressable memory as memory locks, and a memory lock or an outstanding cache fill delays the execution of a cache coherency request upon the same memory block. When a cache coherency request is received from another processor, the address of the cache coherency request is compared to addresses stored in the content addressable memory, and when there is a match, a bit in the matching entry is set to indicate a delayed request that is executed after the lock is unlocked or the cache is refilled. In a specific embodiment, a memory lock or an outstanding cache fill also stalls a processor read or write to the same memory block.

125 citations


Patent
15 Jul 1992
TL;DR: In this paper, an error transition mode (ETM) is used to prevent cache data not owned in the cache from being accessed by the main memory after the first write request while permitting writeback of the data owned by the cache.
Abstract: Writeback transactions from a processor and cache are fed to a main memory through a writeback queue, and non-writeback transactions from the processor and cache are fed to the main memory through a non-writeback queue. When a cache error is detected, an error transition mode (ETM) is entered that provides limited use of the data in the cache; a read or write request for data not owned in the cache is made to the main memory instead of the cache, even when the data is valid in the cache, although owned data is read from the cache. In ETM, when the processor makes a first write request to data not owned in the cache followed by a second write request to data owned in the cache, write data of the first write request is prevented from being received by the main memory after write data of the second request while permitting writeback of the data owned by the cache. Preferably this is done by sending the write requests from the processor through the non-writeback queue, and when a write request accesses data in a block of data owned by the cache, disowning the block of data in the cache and writing the disowned block of data back to the main memory.

Patent
28 May 1992
TL;DR: In this article, a dynamic determination is made on a cycle-by-cycle basis of whether data should be written to the cache with a dirty bit asserted, or the data to both the cache and main memory.
Abstract: Method and apparatus for reducing the access time required to write to memory and read from memory in a computer system having a cache-based memory. A dynamic determination is made on a cycle by cycle basis of whether data should be written to the cache with a dirty bit asserted, or the data should be written to both the cache and main memory. The write-through method is chosen where the write-through method is approximately as fast as the write-back method. Where the write-back method is substantially faster than the write-through method, the write-back method is chosen.

Patent
21 Jan 1992
TL;DR: A simple mixed first level cache memory system as mentioned in this paper includes a level 1 cache (52) connected to a processor (54) by read data and write data lines (56) and (58).
Abstract: A simple mixed first level cache memory system (50) includes a level 1 cache (52) connected to a processor (54)by read data and write data lines (56) and (58). The level 1 cache (52) is connected to level 2 cache (60) by swap tag lines (62) and (64), swap data lines (66) and (68), multiplexer (70) and swap/read Line (72). The level 2 cache (60) is connected to the next lower level in the memorv hierarchy by write tag and write data lines (74) and (76). The next lower level in the memory hierarchy below the level 2 cache (60) is also connected by a read data line (78) through the multiplexer (70) and the swap/read line (72) to the level 1 cache (52). When processor (54) requires an instruction or data, it puts out an address on lines (80). If the instruction or data is present in the level 1 cache (52), it is supplied to the processor (54) on read data line (56). If the instruction or data is not present in the level 1 cache (52), the processor looks for it in the level 2 cache (60) by putting out the address of the instruction or data on lines (80). If the instruction or data is in the level 2 cache, it is supplied to the processor (54) through the level 1 cache (52) by means of a swap operation on tag swap lines (62) and (64), swap data lines (66) and (68), multiplexer (70) and swap/read data line (72). If the instruction or data is present in neither the level 1 cache (52) nor the level 2 cache (60), the address on lines (80) fetches the instruction or data from successively lower levels in the memory hierarchy as required via read data line (78), multiplexer (70) and swap/read data line (72). The instruction or data is then supplied from the level 1 cache to the processor (54).

Patent
27 Apr 1992
TL;DR: In this paper, a write generate mode is implemented for updating cache by first allocating lines of shared memory as write before read areas and cache tags are updated directly on cache misses without reading from memory.
Abstract: A plurality of program processors, shared memory, dual port memory, external cache memory and a control processor form a multiprocessor system. A shared memory bus links the program processors, shared memory, dual port memory and external cache memory. Program processor I/O occurs through a pair of serial I/O channels coupled to one port of the dual port memory. A write generate mode is implemented for updating cache by first allocating lines of shared memory as write before read areas. For such lines, cache tags are updated directly on cache misses without reading from memory. A hit is forced for such line, resulting in valid data at the updated part and invalid data at the remaining portion. Thus, part of the line is written to and the rest invalidated. The invalid portions are not read, unless preceded by a write operation. The mode reduces the number of bus cycles by making write misses more efficient.

Patent
24 Apr 1992
TL;DR: In this article, the bus interface is coupled to the processor, the backup cache memory and to the bus in accordance with a SNOOPY protocol to monitor transactions on the bus for write transactions affecting data items in the corresponding secondary cache memory having set VALID indicators.
Abstract: A processor apparatus for use in a multiprocessor computer system having a main memory storing a plurality of data items and being coupled to a bus operating according of a SNOOPY protocol. The processor apparatus includes a processor, a primary cache, a backup cache and a bus interface. The backup cache memory a first TAG store comprising a plurality of VALID indicators, one VALID indicator for each of the data items currently contained in the backup cache memory. The primary cache memory includes a second TAG store comprising a plurality of address indicators and a plurality of VALID indicators, one address indicator and one VALID indicator for each of the data items currently contained in the primary cache memory. The interface includes a duplicate TAG store coupled to the primary cache memory, the duplicate TAG store consisting of a copy of the address indicators of the second TAG store. The bus interface is coupled to the processor, the backup cache memory and to the bus. The bus interface operates in accordance with a SNOOPY protocol to monitor transactions on the bus for write transactions affecting data items in the corresponding backup cache memory having set VALID indicators. The bus interface will invalidate or update each VALID data item of the backup cache memory when there is a write transaction affecting data item and assert an invalidate signal for an affected data item indicated by the address indicators of the duplicate TAG store. The invalidate signal causes the VALID indicator in the second TAG store for the affected data item to be cleared.

Proceedings ArticleDOI
01 Sep 1992
TL;DR: The cache performance of a commercial System V UNIX rtrttrtittg on a four-CPU multiprocessor is characterized and three major sources of OS misses are revealed: instruction fetehea, process migratiom and data accesses in block operations.
Abstract: Good cache memory performance is essential to achieving high CPU utilization in shared-memory multiprocessors. While the performance of caches is determined by both application end operating system (OS ) references, most research has focused on the cache performance of applications afone. This is partiafly due to the difficulty of measuring OS activity and as a resrtl~ the cache performance of the OS is largely unknown. In this paper, we characterize the cache performance of a commercial System V UNIX rtrttrtittg on a four-CPU multiprocessor. The related issue of the performance impact of the OS synchronization activity is tdso stttdicd. For our study, we use a hardware monitor that records the cache misses in the machine without perturbing it. We study three multiprocessor workloads: a parallel Compilq a multiprogrsmmed load and a commercial database. Our results show that OS misses occur frequently enough to stall CPUS for 17-21 ‘Yoof their non-idle time. Further, if we include application misses induced by OS interference in the cache, then the SQU time reaches 25%. A detailed analysis reveals three major sources of OS misses: instruction fetehea, process migratiom and data accesses in block operations. As for synchronization behavior, we find that OS syncfrrordzation has low overhead if supported correctly end that OS locks show good locality and low contention.

Patent
Tadaaki Bandoh1
24 Aug 1992
TL;DR: In this paper, a multi-port cache memory of multicore memory structure is connected to and shared with a plurality of processors, and two sets of interface signal lines, for instruction fetch and for data read/write, to each processor.
Abstract: A multi-port cache memory of multi-port memory structure is connected to and shared with a plurality of processors. The multi-port cache memory may have two sets of interface signal lines, for instruction fetch and for data read/write, to each processor. The multi-port cache memory may also be used only for data read/write. The system performance is further improved if a plurality of processors and a multi-port cache memory are fabricated on a single LSI chip.

Patent
22 Jun 1992
TL;DR: In this article, a vector logic is used to keep track of the vector length and block extra memory addresses generated by the execution unit for the vector elements, and also blocks the memory addresses of masked vector elements so that these addresses are not translated by the memory management unit.
Abstract: A digital computer system includes a scalar CPU, a vector processor, and a shared cache memory. The scalar CPU has an execution unit, a memory management unit, and a cache controller unit. The execution unit generates load/store memory addresses for vector load/store instructions. The load/store addresses are translated by the memory management unit, and stored in a write buffer that is also used for buffering scalar write addresses and write data. The cache controller coordinates-loads and stores between the vector processor and the shared cache with scalar reads and writes to the cache. Preferably the cache controller permits scalar reads to precede scalar writes and vector load/stores by checking for conflicts with scalar writes and vector load/stores in the write queue, and also permits vector load/stores to precede vector operates by checking for conflicts with vector operate information stored in a vector register scoreboard. Preferably the cache controller includes vector logic which is responsive to vector information written in intra-processor registers by the execution unit. The vector logic keeps track of the vector length and blocks extra memory addresses generated by the execution unit for the vector elements. The vector logic also blocks the memory addresses of masked vector elements so that these addresses are not translated by the memory management unit.

Proceedings ArticleDOI
01 Apr 1992
TL;DR: It is concluded that an adjustable block size cache offers significantly better performance than every fixed block size caches, especially when there is variability in the granularity of sharing exhibited by applications.
Abstract: Several studies have shown that the performance of coherent caches depends on the relationship between the granularity of sharing and locality exhibited by the program and the cache block size. Large cache blocks exploit processor and spatial locality, but may cause unnecessary cache invalidations due to false sharing. Small cache blocks can reduce the number of cache invalidations, but increase the nuber of bus or network transactions required to load data into the cache. In this paper we describe a cache organization that dynamically adjusts the cache block size according to recently observed reference behavior. Cache blocks are split across cache lines when false sharing occurs, ad merged back into a single cache line to explit spatial locality. To evaluate this cache organization, we simulate a scalable multiprocessor with coherent caches, using a suite of memory reference traces to model program behavior. We show that for evry fixed block size, some program suffers a 33% increase in the average waiting time per reference, and a factor of 2 increase in the average number of words transferred per reference, when compared against the performance of an adjustable block size cache. In the few cases where adjusting the block size does not provide superior performance, it comes within 7% of the best fixed block size alternative. We conclude that an adjustable block size cache offers significantly better performance than every fixed block size cache, especially when there is variability in the granularity of sharing exhibited by applications.

Proceedings ArticleDOI
01 Apr 1992
TL;DR: A new technique for reducing direct-mapped cache misses caused by conflicts for a particular cache line is presented, which shows an average reduction in miss rate of 33% for a 32KB instruction cache with 16B lines.
Abstract: Most recent cache designs use direct-mapped caches to provide the fast access time required by modern high speed CPU's. Unfortunately, direct-mapped caches have higher miss rates than set-associative caches, largely because direct-mapped caches are more sensitive to conflicts between items needed frequently in the same phase of program execution.This paper presents a new technique for reducing direct-mapped cache misses caused by conflicts for a particular cache line. A small finite state machine recognizes the common instruction reference patterns where storing an instruction in the cache actually harms performance. Such instructions are dynamically excluded, that is they are passed directly through the cache without being stored. This reduces misses to the instructions that would have been replaced.The effectiveness of dynamic exclusion is dependent on the severity of cache conflicts and thus on the particular program and cache size of interest. However, across the SPEC benchmarks, simulation results show an average reduction in miss rate of 33% for a 32KB instruction cache with 16B lines. In addition, applying dynamic exclusion to one level of a cache hierarchy can improve the performance of the next level since instructions do not need to be stored on both levels. Finally, dynamic exclusion also improves combined instruction and data cache miss rates.

Patent
Lishing Liu1
30 Apr 1992
TL;DR: In this article, the authors propose an approach to predict virtual address translation information with high accuracy using a history table SETLAT and a similar hashing history table, which can not only allow efficient implementation of the cache access path but also offer the opportunity of achieving multiple accesses per cycle.
Abstract: A cache control maintains a history table SETLAT for the prediction of line entry (i.e., set member) within a congruence class for cache accessing. For a given cache access, a SETLAT entry can be selected based on the requesting logical address bits directly. The selection of a SETLAT entry may also be based on the hashing of such logical address bits together with other information in order to achieve sufficient randomization. A similar hashing history table may be devised to predict virtual address translation information with high accuracy. Such prediction mechanisms not only allow efficient implementation of the cache access path but also offer the opportunity of achieving multiple accesses per cycle. The proposed prediction method also provides a generic approach to efficient implementations for various directory based table accesses.

Patent
17 Dec 1992
TL;DR: In this article, the inventive controller includes a first cluster for directing data from a host computer to a storage device and a second cluster for transferring data from the host to the storage device.
Abstract: A storage controller having additional cache memory and a system for recovering from failure and reconfiguring a control unit thereof in response thereto. The inventive controller includes a first cluster for directing data from a host computer to a storage device and a second cluster for directing data from a host computer to a storage device. A first cache memory is connected to the first cluster and a second cache memory is connected to the second cluster. A first nonvolatile memory is connected to the second cluster and a second nonvolatile memory is connected to the first cluster. Data is directed to the first cache and backed up to the first nonvolatile memory. The second cache is similarly backed up by the second nonvolatile memory. In the event of failure of the first cache memory, data is directed to the second cache and backed up in the second nonvolatile memory.

Patent
22 Jun 1992
TL;DR: In this paper, an intelligent cache memory system and associated method for reducing a central processing unit (CPU) idle time is proposed, which performs prefetches based on data fetching characteristics of the CPU.
Abstract: An intelligent cache memory system and associated method for reducing a central processing unit (CPU) idle time. The system performs prefetches based on data fetching characteristics of the CPU. The system includes cache control logic, a first and a second cache memory, each having a number of cache lines, and a first and a second cache tag array, each having cache tag entries corresponding to the cache lines. The cache tag entries comprise cache tags and valid bits. The cache tag entries of the second cache tag array further comprise interest bits. In addition to their traditional functions, the cache tags and the valid bits, in conjunction with the interest bits, are used to track the data fetching history of the CPU. For each read cycle, the cache control logic returns the data being fetched by the CPU from either the first or the second cache memory or the main memory. Additionally, the cache control logic initiates prefetch and updates the data fetching history conditionally. The data fetched from either the second cache memory or the main memory are also stored in the first cache memory, whereas the data prefetched are stored in the second cache memory. Prefetch is conditioned on the data fetching history, while data fetching history update is conditioned on where the data requested by the CPU are fetched. As a result, CPU idle time is further reduced and system performance is further improved.

Patent
07 May 1992
TL;DR: In this article, a cache controller maintains cache consistency in a cache memory structure having a single copy of a cache tag memory while supporting multiple outstanding operations in a multiple processor computer system.
Abstract: Apparatus and methods for a cache controller to maintain cache consistency in a cache memory structure having a single copy of a cache tag memory while supporting multiple outstanding operations in a multiple processor computer system. The CPU includes a small internal cache memory structure. A substantially larger external cache array is coupled to both the CPU and the CC via first, integrated address and data bus. The CC is in turn coupled to a second bus interconnecting, among other devices, processors, I/O devices, and a main memory. The external cache is subblocked. A cache directory in the CC tracks usage of the external cache. An input buffer in the CC is connected to the first bus to provide buffering of commands sent by the CPUs. An output buffer in the CC is coupled to the second bus for buffering commands directed by the CC to devices operating on the second bus. A virtual bus interface (VBI) receives entries made in the input buffer, whereafter the input buffer is relieved to accept other commands. A cache invalidation queue (CIQ) register stores addresses of cache subblocks to which incoming invalidate operations have been directed. The address of the destination device is also written to the output buffer. If the address of the destination device stored in the output buffer matches the address in the CIQ register, the CC will issue a read-invalidate command, wherein the invalidated block of cache is again filled with data corresponding to the prior-accessing processor, thus invalidating the intervening overwrite issued by the later accessing CPU. Response time to snooping requests is thereby bounded, and data consistency between cache and processor are thereby maintained.

Patent
23 Oct 1992
TL;DR: In this paper, the authors present a method for continuously reading data from a compact disk drive (12) to a host computer (14) in which the data is transferred to the computer's cache memory (32) so that it is available to the central processor (26) for processing without interruption.
Abstract: Method and apparatus for continuously reading data from a compact disk drive (12) to a host computer (14) in which the data is transferred to the computer's cache memory (32) so that it is available to the computer's central processor (26) for processing without interruption. A communication link is established between a processor in the disk drive (22) and a direct memory access (DMA) controller (36) in the host computer (14). With this link established, the DMA controller directs the transfer of the incoming data stream from the drive to the cache memory (32) of the host computer. Once the data is in the cache memory (32), it is moved to the application workspace of the computer's system random access memory (RAM) (34), for processing by the central processor (26). The transfer of data by the DMA controller (36) and the temporary storage of the transferred data in the cache memory (32), allows for continuous transfer of data from the compact disk drive (12), without interruption, and without the need for reseeks of the data.

Patent
22 Jun 1992
TL;DR: In this article, a cache coherency request is made by a processor to a second processor to invalidate an addressed block of data without retaining a validated block of fill data in the cache.
Abstract: A processor and method for delaying the processing of cache coherency transactions during outstanding cache fills in a multi-processor system using a shared memory. A first processor fetches data having a specified address by addressing a cache memory, and when the specified address is not in the cache, saving the specified address in a fill address memory, and sending a fill request to the shared memory. Before return of fill data, the first processor receives a cache coherency request including the specified address from a second processor requesting invalidation of an addressed block of data. The first processor responds by checking whether the fill address memory includes the specified address, and upon finding the specified address in the fill address memory, delaying execution of the cache coherency request until the fill data is returned, and when the fill data is returned, using the fill data without retaining a validated block of the fill data in the cache. In a preferred embodiment, the fill memory is a content-addressable memory including a plurality of entries, and each entry has a fill address, an ownership fill bit (OREAD), an ownership-read invalidate pending bit (OIP), and a read invalidate pending bit (RIP). The OIP or RIP bit is set when execution of a cache coherency request is delayed, and these bits are read upon completion of a fill to execute the delayed request.

Patent
28 Feb 1992
TL;DR: In this paper, the cache control logic (60) provides an external transfer code signal which allows a user to know when a cache transaction is performed, and the external signals are used to provide external signals which are necessary to execute each of the control instructions.
Abstract: A circuit for allowing greater user control over a cache memory is implemented in a data processor (20). Cache control instructions have been implemented to perform touch load, flush, and allocate operations in data cache (54) of data cache unit (24). The control instructions are decoded by both instruction cache unit (26) and sequencer (34) to provide necessary control and address information to load/store unit (28). Load/store unit (28) sequences execution of each of the instructions, and provides necessary control and address information to data cache unit (24) at an appropriate point in time. Cache control logic (60) subsequently processes both the address and control information to provide external signals which are necessary to execute each of the cache control instructions. Additionally, cache control logic (60) provides an external transfer code signal which allows a user to know when a cache transaction is performed.

Journal ArticleDOI
TL;DR: Results of a performance comparison indicate that the proposed timestamp-based cache coherence scheme performs significantly better than previous software-assisted schemes, especially when the processors are carefully scheduled so as to maximize the reuse of cache contents.
Abstract: A timestamp-based software-assisted cache coherence scheme that does not require any global communication to enforce the coherence of multiple private caches is proposed. It is intended for shared memory multiprocessors. The scheme is based on a compile-time marking of references and a hardware-based local incoherence detection scheme. The possible incoherence of a cache entry is detected and the associated entry is implicitly invalidated by comparing a clock (related to program flow) and a timestamp (related to the time of update in the cache). Results of a performance comparison, which is based on a trace-driven simulation using actual traces. between the proposed timestamp-based scheme and other software-assisted schemes indicate that the proposed scheme performs significantly better than previous software-assisted schemes, especially when the processors are carefully scheduled so as to maximize the reuse of cache contents. This scheme requires neither a shared resource nor global communication and is, therefore, scalable up to a large number of processors. >

Patent
30 Apr 1992
TL;DR: In this article, an I/O write back cache memory and a data coherency method is provided to a computer system having a cache and a main memory, which includes partitioning the main memory into memory segments, dynamically assigning and reassigning the ownership of the memory segments either to the cache memory or the write back memory.
Abstract: An I/O write back cache memory and a data coherency method is provided to a computer system having a cache and a main memory. The data coherency method includes partitioning the main memory into memory segments, dynamically assigning and reassigning the ownership of the memory segments either to the cache memory or the I/O write back cache memory. The ownership of the memory segments controls the accessibility and cacheability of the memory segments for read and write cycles performed by the CPU and I/O devices. During reassignment, various data management actions are taken to ensure data coherency. As a result, the I/O devices can perform read and write cycles addressed against the cache and main memory in a manner that increases system performance with minimal increase in hardware and complexity cost.

Patent
Osamu Nishii1, Kunio Uchiyama1, Hirokazu Aoki1, Kanji Oishi1, Jun Kitano1, Susumu Hatano1 
24 Sep 1992
TL;DR: In this article, a multiprocessor system consisting of first and second processors (1001 and 1002), first-and second cache memories (100:#1 and #2), an address bus (123), a data bus (126), an invalidating signal line (PURGE:131), and a main memory (1004) is described.
Abstract: Herein disclosed is a multiprocessor system which comprises first and second processors (1001 and 1002), first and second cache memories (100:#1 and #2), an address bus (123), a data bus (126), an invalidating signal line (PURGE:131) and a main memory (1004). The first and second cache memories are operated by the copy-back method. The state of the data of the first cache (100:#1) exists in one state selected from a group consisting of an invalid first state, a valid and non-updated second state and a valid and updated third state. The second cache (100:#2) is constructed like the first cache. When the write access of the first processor hits the first cache, the state of the data of the first cache is shifted from the second state to the third state, and the first cache outputs the address of the write hit and the invalidating signal to the address bus and the invalidating signal line, respectively. When the write access from the first processor misses the first cache, a data of one block is block-transferred from the main memory to the first cache, and the invalidating signal is outputted. After this, the first cache executes the write of the data in the transfer block. In case the first and second caches hold the data in the third state relating to the pertinent address when an address of an access request is fed to the address bus (123), the pertinent cache writes back the pertinent data in the main memory.