scispace - formally typeset
Search or ask a question

Showing papers on "Cache pollution published in 1991"


Proceedings ArticleDOI
01 Apr 1991
TL;DR: It is shown that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes.
Abstract: Blocking is a well-known optimization technique for improving the effectiveness of memory hierarchies. Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks, so that data loaded into the faster levels of the memory hierarchy are reused. This paper presents cache performance data for blocked programs and evaluates several optimization to improve this performance. The data is obtained by a theoretical model of data conflicts in the cache, which has been validated by large amounts of simulation. We show that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes. The conventional wisdom of frying to use the entire cache, or even a fixed fraction of the cache, is incorrect. If a fixed block size is used for a given cache size, the block size that minimizes the expected number of cache misses is very small. Tailoring the block size according to the matrix size and cache parameters can improve the average performance and reduce the variance in performance for different matrix sizes. Finally, whenever possible, it is beneficial to copy non-contiguous reused data into consecutive locations.

982 citations


Patent
06 Feb 1991
TL;DR: The branch prediction cache (BPC) as mentioned in this paper provides a tag identifying the address of instructions causing a branch, a record of the target address which was branched to on the last occurrence of each branch instruction, and a copy of the first several instructions beginning at this target address.
Abstract: The present invention provides for the updating of both the instructions in a branch prediction cache and instructions recently provided to an instruction pipeline from the cache when an instruction being executed attempts to change such instructions ("Store-Into-Instruction-Stream"). The branch prediction cache (BPC) includes a tag identifying the address of instructions causing a branch, a record of the target address which was branched to on the last occurrence of each branch instruction, and a copy of the first several instructions beginning at this target address. A separate instruction cache is provided for normal execution of instructions, and all of the instructions written into the branch prediction cache from the system bus must also be stored in the instruction cache. The instruction cache monitors the system bus for attempts to write to the address of an instruction contained in the instruction cache. Upon such a detection, that entry in the instruction cache is invalidated, and the corresponding entry in the branch prediction cache is invalidated. A subsequent attempt to use an instruction in the branch prediction cache which has been invalidated will detect that it is not valid, and will instead go to main memory to fetch the instruction, where it has been modified.

279 citations


Proceedings ArticleDOI
01 Apr 1991
TL;DR: This work fed address traces of the processes running on a multi-tasking operating system through a cache simulator, to compute accurate cache-hit rates over short intervals, and estimated the cache performance reduction caused by a context switch.
Abstract: The sustained performance of fast processors is critically dependent on cache performance. Cache performance in turn depends on locality of reference. When an operating system switches contexts, the assumption of locality may be violated because the instructions and data of the newly-scheduled process may no longer be in the cache(s). Context-switching thus has a cost above that associated with that of the operations performed by the kernel. We fed address traces of the processes running on a multi-tasking operating system through a cache simulator, to compute accurate cache-hit rates over short intervals. By marking the output of such a simulation whenever a context switch occurs, and then aggregating the post-context-switch results of a large number of context switches, it is possible to estimate the cache performance reduction caused by a switch. Depending on cache parameters the net cost of a context switch appears to be in the thousands of cycles, or tens to hundreds of microseconds. This technical note is a preprint of a paper to appear in the Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.

272 citations


Proceedings ArticleDOI
01 Apr 1991
TL;DR: Multi-port, nonblocking (MPNB) L1 caches introduced in this paper for the top of the data memory hierarchy appear to be capable of supporting the bandwidth demands of futuregeneration superscalar processors.
Abstract: This paper considers the design of a data memory hierarchy, with a level 1 (L1) data cache at the top, to support the data bandwidth demands of a future-generation superscalar processor capable of issuing about ten instructions per clock cycle. It introduces the notion of cache bandwidfh — the bandwidth with which a cache can accept requests from the processor — and shows how the bandwidth of a standard, blocking cache, can degrade greatly because of its inability to overlap the service of misses. Non-blocking or lockup-free caches are discussed as a way of reducing the bandwidth degradation due to misses. To improve the data bandwidth to greater than 1 request per cycle, multi-port, interleaved caches are introduced. Simulation results from a cycle-by-cycle simulator, using the MIPS R2000 instruction set, suggest that memory hierarchies with blocking L 1 caches will be unable to support the bandwidth demands of futuregeneration superscalar processors. Multi-port, nonblocking (MPNB) L1 caches introduced in this paper for the top of the data memory hierarchy appear to be capable of supporting such data bandwidth demands.

215 citations


Patent
16 May 1991
TL;DR: In this paper, a microprocessor partially decodes instructions retrieved from main memory before placing them into the microprocessor's integrated instruction cache, each storage location in the instruction cache includes two slots for decoded instructions.
Abstract: A microprocessor partially decodes instructions retrieved from main memory before placing them into the microprocessor's integrated instruction cache. Each storage location in the instruction cache includes two slots for decoded instructions. One slot controls one of the microprocessor's integer pipelines and a port to the microprocessor's data cache. A second slot controls the second integer pipeline or one of the microprocessor's floating point units. The instructions retrieved from main memory are decoded by a loader unit which decodes the instructions from the compact form as stored in main memory and places them into the two slots of the instruction cache entry according to their functions. In addition, auxiliary information is placed in the cache entry along with the instruction to control parallel execution as well as emulation of complex instructions. A bit in each instruction cache entry indicates whether the instructions in the two slots are independent, so that they can be executed in parallel, or dependent, so that they must be executed sequentially. Using a single bit for this purpose allows two dependent instructions to be stored in the slots of the single cache entry.

203 citations


Patent
16 May 1991
TL;DR: In this article, a microprocessor partially decodes instructions retrieved from main memory before placing them into the microprocessor's integrated instruction cache, each storage location in the instruction cache includes two slots for decoded instructions.
Abstract: of EP0459232A microprocessor partially decodes instructions retrieved from main memory before placing them into the microprocessor's integrated instruction cache. Each storage location in the instruction cache includes two slots for decoded instructions. One slot controls one of the microprocessor's integer pipelines and a port to the microprocessor's data cache. A second slot controls the second integer pipeline or one of the microprocessor's floating point units. The instructions retrieved from main memory are decoded by a loader unit which decodes the instructions from the compact form as stored in main memory and places them into the two slots of the instruction cache entry according to their functions. In addition, auxiliary information is placed in the cache entry along with the instruction to control parallel execution as well as emulation of complex instructions. A bit in each instruction cache entry indicates whether the instructions in the two slots are independent, so that they can be executed in parallel, or dependent, so that they must be executed sequentially. Using a single bit for this purpose allows two dependent instructions to be stored in the slots of the single cache entry.

169 citations


Proceedings ArticleDOI
01 Apr 1991
TL;DR: This paper reports the cache performance of a set of vectorized numerical program from the Perfect Club benchmarks and describes two simple prefetch schemes to reduce the influence of long stride vector accesses and misses due IO block invalidations in mulliprocessor vector caches.
Abstract: This paper reports the cache performance of a set of vectorized numerical program from the Perfect Club benchmarks. Using a low cost trace driven simularion technique we show how a non-prefetching vector cache can result in unpredictable performance and how rhis unpredictability makes it difficult to find a good block size. We describe two simple prefetch schemes to reduce the influence of long stride vector accesses and misses due IO block invalidations in mulliprocessor vector caches. These two schemes are shown to have better performance than a non-prefetching cache.

163 citations


Patent
Richard Lewis Mattson1
20 May 1991
TL;DR: In this paper, the cache hierarchy is logically partitioned to form a least recently used (LRU) global cache and a plurality of LRU destaging local caches, where each local cache is bound to objects having a unique data type T(i), where i is indicative of a DataType.
Abstract: A method for managing a cache hierarchy having a fixed total storage capacity is disclosed. The cache hierarchy is logically partitioned to form a least recently used (LRU) global cache and a plurality of LRU destaging local caches. The global cache stores objects of all types and maintains them in LRU order. In contrast, each local cache is bound to objects having a unique data type T(i), where i is indicative of a DataType. Read and write accesses by referencing processors or central processing units (CPU's) are made to the global cache. Data not available in the global cache is staged thereto either from one of the local caches or from external storage. When a cache full condition is reached, placement of the most recently used (MRU) data element to the top of the global cache results in an LRU data element of type T(i) being destaged from the global cache to a corresponding one of the local caches storing type T(i) data. Likewise, when a cache full condition is reached in any one or more of the local caches, the local caches in turn will destage their LRU data elements to external storage. The parameters defining the partitions are externally supplied.

132 citations


Patent
20 Mar 1991
TL;DR: In this article, a coherent coupled memory multiprocessor computer system with a plurality of processor modules (11a, 11b... ), a global interconnect (13), an optional global memory (15), and an input/output subsystem (17, 19) is disclosed.
Abstract: A coherent coupled memory multiprocessor computer system that includes a plurality of processor modules (11a, 11b . . . ), a global interconnect (13), an optional global memory (15) and an input/output subsystem (17,19) is disclosed. Each processor module (11a, 11b . . . ) includes: a processor (21); cache memory (23); cache memory controller logic (22); coupled memory (25); coupled memory control logic (24); and a global interconnect interface (27). Coupled memory (25) associated with a specific processor (21), like global memory (15), is available to other processors (21). Coherency between data stored in coupled (or global) memory and similar data replicated in cache memory is maintained by either a write-through or a write-back cache coherency management protocol. The selected protocol is implemented in hardware, i.e., logic, form, preferably incorporated in the coupled memory control logic (24) and in the cache memory controller logic (22). In the write-through protocol, processor writes are propagated directly to coupled memory while invalidating corresponding data in cache memory. In contrast, the write-back protocol allows data owned by a cache to be continuously updated until requested by another processor, at which time the coupled memory is updated and other cache blocks containing the same data are invalidated.

130 citations


Patent
19 Aug 1991
TL;DR: In this paper, a scheduler allocates engines to processes and schedules the processes to run on the basis of priority and engine availability, which increases computing system performance and reduces bus traffic.
Abstract: A computing system (50) includes N number of symmetrical computing engines having N number of cache memories joined by a system bus (12). The computing system includes a global run queue (54), an FPA global run queue, and N number of affinity run queues (58). Each engine is associated with one affinity run queue, which includes multiple slots. When a process first becomes runnable, it is typically attached one of the global run queues. A scheduler allocates engines to processes and schedules the processes to run on the basis of priority and engine availability. An engine typically stops running a process before it is complete. When the process becomes runnable again the scheduler estimates the remaining cache context for the process in the cache of the engine. The scheduler uses the estimated amount of cache context in deciding in which run queue a process is to be enqueued. The process is enqueued to the affinity run queue of the engine when the estimated cache context of the process is sufficiently high, and is enqueued onto the global run queue when the cache context is sufficiently low. The procedure increases computing system performance and reduces bus traffic because processes will run on engines having sufficient cache affinity, but will also run on the best available engine when there is insufficient cache context.

127 citations


Patent
20 Aug 1991
TL;DR: In this paper, a multilevel cache buffer for a multiprocessor system is described, where each processor has a level one cache storage unit which interfaces with a level two cache unit and main storage unit shared by all processors.
Abstract: A multilevel cache buffer for a multiprocessor system in which each processor has a level one cache storage unit which interfaces with a level two cache unit and main storage unit shared by all processors. The multiprocessors share the level two cache according to a priority algorithm. When data in the level two cache is updated, corresponding data in level one caches is invalidated until it is updated.

Patent
30 Aug 1991
TL;DR: In this article, a method and apparatus for incorporating cache line replacement and cache write policy information into the tag directories in a cache system is presented, which can be generalized to caches which include a number of ways greater than two by using a pseudo-LRU algorithm and utilizing group select bits in each way to distinguish between least recently used groups.
Abstract: A method and apparatus for incorporating cache line replacement and cache write policy information into the tag directories in a cache system. In a 2 way set-associative cache, one bit in each way's tag RAM is reserved for LRU information, and the bits are manipulated such that the Exclusive-OR of each way's bits points to the actual LRU cache way. Since all of these bits must be read when the cache controller determines whether a hit or miss has occurred, the bits are available when a cache miss occurs and a cache line replacement is required. The method can be generalized to caches which include a number of ways greater than two by using a pseudo-LRU algorithm and utilizing group select bits in each of the ways to distinguish between least recently used groups. Cache write policy information is stored in the tag RAM's to designate various memory areas as write-back or write-through. In this manner, system memory situated on an I/O bus which does not recognize inhibit cycles can have its data cached.

Patent
09 Jul 1991
TL;DR: In this paper, a method for data consistency between a plurality of individual processor cache memories and the main memory in a multi-processor computer system is provided which is capable of detecting when one of a set of predefined data inconsistency states occurs as a data transaction request is being processed, and correcting the data inconsistencies states so that the operation may be executed in a correct and consistent manner.
Abstract: A method for insuring data consistency between a plurality of individual processor cache memories and the main memory in a multi-processor computer system is provided which is capable of (1) detecting when one of a set of predefined data inconsistency states occurs as a data transaction request is being processed, and (2) correcting the data inconsistency states so that the operation may be executed in a correct and consistent manner. In particular, the method is adapted to address two kinds of data inconsistency states: (1) A request for a operation from a system unit to main memory when the location to be written to is present in the cache of some processor unit-in such a case, data in the cache is "stale" and the data inconsistency is avoided by preventing the associated processor from using the "stale" data; and (2) when a read operation is requested of main memory by a system unit and the location to be read may be written or has already been written in the cache of some processor--in this case, the data in main memory is "stale" and the data inconsistency is avoided by insuring that the data returned to the requesting unit is the updated data in the cache. The presence of one of the above-described data inconsistency states is detected in a SCU-based multi-processing system by providing the SCU with means for maintaining a copy of the cache directories for each of the processor caches. The SCU continually compares address data accompanying memory access requests with what is stored in the SCU cache directories in order to determine the presence of predefined conditions indicative of data inconsistencies, and subsequently executes corresponding predefined fix-up sequences.

Patent
27 Jun 1991
TL;DR: In this paper, a branch prediction method employs a branch history table which records the taken vs. not-taken history of branch opcodes recently used, and uses an empirical algorithm to predict which way the next occurrence of this branch will go, based upon the history table.
Abstract: ERROR TRANSITION MODE FOR MULTI-PROCESSOR SYSTEM ABSTRACT OF THE DISCLOSURE A pipelined CPU executing instructions of variable length, and referencing memory using various data widths. Macroinstruction pipelining is employed (instead of microinstruction pipelining), with queueing between units of the CPU to allow flexibility in instruction execution times. A wide bandwidth is available for memory access; fetching 64-bit data blocks on each cycle. A hierarchical cache arrangement has an improved method of cache set selection, increasing the likelihood of a cache hit. A writeback cache is used (instead of writethrough) and writeback is allowed to proceed even though other accesses are suppressed due to queues being full. A branch prediction method employs a branch history table which records the taken vs. not-taken history of branch opcodes recently used, and uses an empirical algorithm to predict which way the next occurrence of this branch will go, based upon the history table. A floating point processor function is integrated on-chip, with enhanced speed due to a bypass technique; a trial mini-rounding is done on low-order bits of the result, and if correct, the last stage of the floating point processor can be bypassed, saving one cycle of latency. For CAL type instructions, a method for determining which registers need to be saved is executed in a minimum number of cycles, examining groups of register mask bits at one time. Internal processor registers are accessed with short (byte width) addresses instead of full physical addresses as used for memory and I/O references, but off-chip processor registers are memory-mapped and accessed by the same busses using the same controls as the memory and I/O. If a non-recoverable error detected by ECC circuits in the cache, an error transition mode is entered wherein the cache operates under limited access rules, allowing a maximum of access by the system for data blocks owned by the cache, but yet minimizing changes to the cache data so that diagnostics may be run. Separate queues are provided for the return data from memory and cache invalidates, yet the order or bus transactions is maintained by a pointer arrangement. The bus protocol used by the CPU to communicate with the system bus is of the pended type, with transactions on the bus identified by an ID field specifying the originator, and arbitration for bus grant goes one simultaneously with address/data transactions on the bus.

Patent
Igal Megory-Cohen1
01 Nov 1991
TL;DR: In this paper, a modified steepest descent method is proposed to handle unpredictable local cache activities prior to cache repartitioning to avoid readjustments which would result in unacceptably small or negative cache sizes in cases where a local cache is extremely underutilized.
Abstract: Dynamic partitioning of cache storage into a plurality of local caches for respective classes of competing processes is performed by a step of dynamically determining adjustments to the cache partitioning using a steepest descent method. A modified steepest descent method allows unpredictable local cache activities prior to cache repartitioning to be taken into account to avoid readjustments which would result in unacceptably small or, even worse, negative cache sizes in cases where a local cache is extremely underutilized. The method presupposes a unimodal distribution of cache misses.

Patent
23 May 1991
TL;DR: In this paper, a directory-based protocol for maintaining data coherency in a multiprocessing (MP) system having a number of processors with associated write-back caches, a multistage interconnection network (MIN) leading to a shared memory, and a global directory associated with the main memory to keep track of state and control information of cache lines.
Abstract: A directory-based protocol is provided for maintaining data coherency in a multiprocessing (MP) system having a number of processors with associated write-back caches, a multistage interconnection network (MIN) leading to a shared memory, and a global directory associated with the main memory to keep track of state and control information of cache lines. Upon a request by a requesting cache for a cache line which has been exclusively modified by a source cache, two buffers are situated in the global directory to collectively intercept modified data words of the modified cache line during the write-back to memory. A modified word buffer is used to capture modified words within the modified cache line. Moreover, a line buffer stores an old cache line transferred from the memory, during the write back operation. Finally, the line buffer and the modified word buffer, together, provide the entire modified line to a requesting cache.

Patent
Jamshed H. Mirza1
15 Apr 1991
TL;DR: In this paper, a cache bypass mechanism automatically avoids caching of data for instructions whose data references, for whatever reason, exhibit low cache hit ratio, and this record is used to decide whether its future references should be cached or not.
Abstract: A cache bypass mechanism automatically avoids caching of data for instructions whose data references, for whatever reason, exhibit low cache hit ratio. The mechanism keeps a record of an instruction's behavior in the immediate past, and this record is used to decide whether its future references should be cached or not. If an instruction is experiencing bad cache hit ratio, it is marked as non-cacheable, and its data references are made to bypass the cache. This avoids the additional penalty of unnecessarily fetching the remaining words in the line, reduces the demand on the memory bandwidth, avoids flushing the cache of useful data and, in parallel processing environments, prevents line thrashing. The cache management scheme is automatic and requires no compiler or user intervention.

Patent
26 Dec 1991
TL;DR: In this paper, a reconfigurable set associative cache memory can be reconfigured from 2x way to 2y way associative memory by merging a predetermined number of least significant bits of the tag field of the main memory address with the line field.
Abstract: A reconfigurable set associative cache memory can be reconfigured from 2x way to 2y way set associative cache memory by effectively merging a predetermined number of least significant bits of the tag field of the main memory address with the line field of the main memory address. The effective merging is provided by logically merging least significant bits of the tag field with a reconfiguration designation. As a result, Y-X+1 different configurations of cache memory can be obtained using the Y-X least significant bits of the tag field merged with the cache memory address.

Patent
30 Aug 1991
TL;DR: In this article, the cache controller includes a set of latches coupled to the host bus which it uses to latch the state of host bus during a snoop cycle if the cache is unable to immediately snoop that cycle.
Abstract: A method and apparatus for enabling a dual ported cache system in a multiprocessor system to guarantee snoop access to all host bus cycles which require snooping. The cache controller includes a set of latches coupled to the host bus which it uses to latch the state of the host bus during a snoop cycle if the cache controller is unable to immediately snoop that cycle. The cache controller latches that state of the host bus in the beginning of a cycle and preserves this state throughout the cycle due to the effects of pipelining on the host bus. In addition, the cache controller is able to delay host bus cycles to guarantee snoop access to host bus cycles which require snooping. The cache controller generally only delays a host bus cycle when it is already performing other tasks, such as servicing its local processor, and cannot snoop the host bus cycle immediately. When the cache controller latches the state of the bus during a write cycle, it only begins to delay the host bus after a subsequent cycle begins. In this manner, one write cycle can complete on the host bus before the cache controller delays any cycles, thereby reducing the impact of snooping on host bus bandwidth. Read cycles are always delayed until the cache controller can complete the snooping operation because the cache may be the owner of the data and a write back cycle may be necessary.

Patent
27 Jun 1991
TL;DR: In this paper, a branch prediction method employs a branch history table which records the taken vs. not-taken history of branch opcodes recently used, and uses an empirical algorithm to predict which way the next occurrence of this branch will go, based upon the history table.
Abstract: of EP0465320A pipelined CPU executing instructions of variable length, and referencing memory using various data widths. Macroinstruction pipelining is employed (instead of microinstruction pipelining), with queueing between units of the CPU to allow flexibility in instruction execution times. A wide bandwidth is available for memory access; fetching 64-bit data blocks on each cycle. A hierarchical cache arrangement has an improved method of cache set selection, increasing the likelihood of a cache hit. A writeback cache is used (instead of writethrough) and writeback is allowed to proceed even though other accesses are suppressed due to queues being full. A branch prediction method employs a branch history table which records the taken vs. not-taken history of branch opcodes recently used, and uses an empirical algorithm to predict which way the next occurrence of this branch will go, based upon the history table. A floating point processor function is integrated on-chip, with enhanced speed due to a bypass technique; a trial mini-rounding is done on low-order bits of the result, and if correct, the last stage of the floating point processor can be bypassed, saving one cycle of latency. For CAL type instructions, a method for determining which registers need to be saved is executed in a minimum number of cycles, examining groups of register mask bits at one time. Internal processor registers are accessed with short (byte width) addresses instead of full physical addresses as used for memory and I/O references, but off-chip processor registers are memory-mapped and accessed by the same busses using the same controls as the memory and I/O. If a non-recoverable error detected by ECC circuits in the cache, an error transition mode is entered wherein the cache operates under limited access rules, allowing a maximum of access by the system for data blocks owned by the cache, but yet minimizing changes to the cache data so that diagnostics may be run. Separate queues are provided for the return data from memory and cache invalidates, yet the order or bus transactions is maintained by a pointer arrangement. The bus protocol used by the CPU to communicate with the system bus is of the pended type, with transactions on the bus identified by an ID field specifying the originator, and arbitration for bus grant goes one simultaneously with address/data transactions on the bus.

Patent
16 May 1991
TL;DR: In this article, a microprocessor architecture that includes capabilities for locking individual entries into its integrated instruction cache and data cache while leaving the remainder of the cache unlocked and available for use in capturing the microprocessor's dynamic locality of reference is presented.
Abstract: A microprocessor architecture that includes capabilities for locking individual entries into its integrated instruction cache and data cache while leaving the remainder of the cache unlocked and available for use in capturing the microprocessor's dynamic locality of reference. The microprocessor also includes the capability for locking instruction cache entries without requiring that the instructions be executed during the locking process.

Patent
Hiroaki Suzuki1
30 Apr 1991
TL;DR: In this paper, a hit discriminator receives first tag information included in an address given when the cache memory is accessed and second tag information read from cache memory in accordance with the given address, in order to discriminate a cache-hitting and a cache missing.
Abstract: A cache memory apparatus to be coupled to a main memory, comprises a cache memory having a plurality of ports and capable of being independently accessed through the plurality of ports. The cache memory stores a portion of data stored in the main memory and tag information indicating memory locations within the main memory of the data portion stored in the cache memory. A hit discriminator receives first tag information included in an address given when the cache memory is accessed and second tag information read from the cache memory in accordance with the given address, in order to discriminate a cache-hitting and a cache-missing on the basis of the first and second tag information. A replacement control circuit operates for replacing data and corresponding tag information in the cache memory when the cache-missing is discriminated by the hit discriminating circuit. A replacement limiting circuit operates for limiting the replacement of the cache memory to one time when a plurality of accesses to the same address are generated and all the plurality of accesses to the same address are missed.

Patent
Gregory S. Mathews1, Edward Zager1
16 Dec 1991
TL;DR: In this article, a cache memory hierarchy with a first level write through cache memory and a second level write back cache memory is provided to a computer system having a CPU, a main memory, and a number of DMA devices.
Abstract: A cache memory hierarchy having a first level write through cache memory and a second level write back cache memory is provided to a computer system having a CPU, a main memory, and a number of DMA devices. The first level write through cache memory responds to read and write accesses by the CPU, and snoop accesses by the DMA devices, whereas the second level write back cache memory responds to read and write accesses by the CPU as well as the DMA devices. Additionally, the first level write through cache memory is a relatively large cache memory designed to provide a high cache hit rate, whereas the second level write back cache memory is a relatively small cache memory designed to reduce accesses to the main memory. Furthermore, the first level write through cache memory reallocates its cache lines in response to CPU read misses only, whereas the second level write through cache memory reallocates its cache lines in response to CPU write misses only.

Patent
25 Feb 1991
TL;DR: In this paper, the processor accesses the data in the cache memory with one set of clocks while the remainder of the line of data is transferred to the inpage buffer with another set of clock cycles.
Abstract: A computer system comprises a data processor, a main memory, a cache memory and an inpage buffer. The cache memory is coupled to the main memory to receive data therefrom and is coupled to the processor to transfer data thereto. The inpage buffer is coupled to the main memory to receive data therefrom, coupled to the cache memory to transfer data thereto, and coupled to the processor to transfer data thereto. Part of a line of data is originally transferred to the cache memory bypassing the inpage buffer to give the processor immediate access to the data which it needs. The remainder of the line of data is subsequently transferred to the inpage buffer, and then the processor is given access to the contents of the inpage buffer. The processor accesses the data in the cache memory with one set of clocks while the remainder of the line of data is transferred to the inpage buffer with another set of clocks. The two sets of clocks optimize the operation of tile processor and the main memory. Subsequently, the contents of the inpage buffer are transferred to the cache memory at the start of another inpage operation while the next line of data is fetched from the main memory.

Patent
19 Apr 1991
TL;DR: In this article, the data cache is logically partitioned into two separate sections, demand and prefetch, and a cache directory table and a least recently used table are used to maintain the cache.
Abstract: A least recently used cache replacement system in which the data cache is logically partitioned into two separate sections, demand and prefetch. A cache directory table and a least recently used table are used to maintain the cache. When a new demand data page is added to the cache, a most recently used (MRU) pointer is updated and points to this new page. When a prefetch page is added to the cache, the least recently used pointer of the demand section is updated with its backward pointer pointing to this new page. A cache hit on a demand of prefetch page moves that page to the top of the least recently used table. When a free page is needed in the cache, it is selected from the demand or prefetch sections of the memory based on a comparison of the demand hit density and the prefetch hit density so to maintain a balance between these two hit densities.

Patent
31 Dec 1991
TL;DR: In this paper, a data processing system which includes a microprocessor fabricated on an integrated circuit chip, a main memory external to the integrated circuit (IC) chip, and a backside cache external to an IC chip is described.
Abstract: A data processing system which includes a microprocessor fabricated on an integrated circuit chip, a main memory external to the integrated circuit chip, and a backside cache external to the integrated circuit chip. The backside cache includes a directory RAM for storing cache address tag and encoded cache state bits. A first bus connects the microprocessor to the cache, the first bus including backside bus cache directory tags signals comprised of address bits used for a cache hit comparison in the directory RAM and backside bus cache directory state bits for determining a state encoding of a set in the directory RAM. A second bus connects the microprocessor to the main memory. The directory includes means for comparing the cache directory tags on the first bus with the tags stored in the directory and for asserting a Bmiss signal upon the condition that the directory tag stored in the backside bus cache directory do not match the backside bus cache directory tags signals. The microprocessor responds to the Bmiss signal by issuing the access onto the second bus in the event of a cache miss.

Patent
Ikumi Noriyuki1
25 Jan 1991
TL;DR: In this paper, a multiport cache memory control unit includes a central processing unit having N arithmetic units for executing arithmetic processing, a tag memory having N address ports for storing addresses, and a multi-port cache memory with N data ports to store pieces of data at addresses which agree with the addresses stored in the tag memory.
Abstract: A multiport cache memory control unit includes a central processing unit having N arithmetic units for executing arithmetic processing, a tag memory having N address ports for storing addresses, a multiport cache memory having N data ports for storing pieces of data at addresses which agree with the addresses stored in the tag memory, and a snoop address port through which a snoop operation is executed to detect an address signal. Arithmetic processing is executed in each of the arithmetic units by reading a piece of data from the cache memory after providing an address signal to the tag memory to check whether or not the data is stored in the cache memory. In cases where a cache miss occurs, a piece of data stored in a main memory unit is fetched through the snoop address port without halting the arithmetic processing. In cases where a snoop hit occurs, an address signal provided from another control unit is transmitted to the tag memory through the snoop address port without halting the arithmetic processing.

Patent
11 Dec 1991
TL;DR: In this article, a cache memory functioning as a circular buffer for use as a part historical, part predictive cache memory is provided, where the difference between the values in the first and second pointer registers exceeds a predetermined amount.
Abstract: A cache memory functioning as a circular buffer for use as a part historical, part predictive cache memory is provided. A first register contains data having a value corresponding to a cache memory location of a last instruction executed by a processor and a second register contains data having a value corresponding to a memory location in the cache memory of a last prefetched instruction. Prefetching of instructions from a main memory to the cache memory is disabled if the difference between the values in the first and second pointer registers exceeds a predetermined amount.

ReportDOI
01 May 1991
TL;DR: Results suggest that garbage collection algorithms will play an important part in improving cache performance as processor speeds increase and two-way set-associative caches are shown to reduce the miss rate in stop-and-copy algorithms often by a factor of two and sometimes by almost five over direct-mapped caches.
Abstract: : Cache performance is an important part of total performance in modern computer systems. This paper describes the use of trace-driven simulation to estimate the effect of garbage collection algorithms on cache performance Traces from four large Common Lisp programs have been collected and analyzed with an all-associatively cache simulator. While previous work has focused on the effect of garbage collection on page reference locality this evaluation unambiguously shows that garbage collection algorithms can have a profound effect on cache performance as well. On processors with a direct-mapped cache a generation stop-and-copy algorithm exhibits a miss rate up to four times higher than a comparable generation mark-and-sweep algorithm. Furthermore, two-way set-associative caches are shown to reduce the miss rate in stop-and-copy algorithms often by a factor of two and sometimes by a factor of almost five over direct-mapped caches. As processor speeds increase, cache performance will play an increasing role in total performance. These results suggest that garbage collection algorithms will play an important part in improving that performance.

Journal ArticleDOI
TL;DR: Two distributed directory schemes, the tree directory and the hierarchical full-map directory, to deal with the storage overhead problem are presented and should lend themselves to the design and implementation of large-scale cache coherent multiprocessors.
Abstract: Cache coherence problem is a major issue in the design of shared-memory multiprocessors. As the number of processors grows, traditional bus-based snoopy schemes for cache coherence are no longer adequate. Instead, the directory-based scheme is a promising alternative for the large-scale cache coherence problem. However, the storage overhead of (full-map) directory scheme may become too prohibitive as the system size goes up. This paper presents two distributed directory schemes, the tree directory and the hierarchical full-map directory, to deal with the storage overhead problem. Preliminary trace-driven evaluations show that the performance of our schemes compares favorably to the full-map directory scheme, while reducing the storage overhead by over 90%. These two schemes should lend themselves to the design and implementation of large-scale cache coherent multiprocessors.