Showing papers on "Cache pollution published in 1991"

PDF

Open Access

Proceedings Article•DOI•

The cache performance and optimizations of blocked algorithms

[...]

Monica D. Lam, Edward E. Rothberg, Michael E. Wolf

01 Apr 1991

TL;DR: It is shown that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes.

...read moreread less

Abstract: Blocking is a well-known optimization technique for improving the effectiveness of memory hierarchies. Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks, so that data loaded into the faster levels of the memory hierarchy are reused. This paper presents cache performance data for blocked programs and evaluates several optimization to improve this performance. The data is obtained by a theoretical model of data conflicts in the cache, which has been validated by large amounts of simulation. We show that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes. The conventional wisdom of frying to use the entire cache, or even a fixed fraction of the cache, is incorrect. If a fixed block size is used for a given cache size, the block size that minimizes the expected number of cache misses is very small. Tailoring the block size according to the matrix size and cache parameters can improve the average performance and reduce the variance in performance for different matrix sizes. Finally, whenever possible, it is beneficial to copy non-contiguous reused data into consecutive locations.

...read moreread less

982 citations

Patent•

Method and apparatus for store-into-instruction-stream detection and maintaining branch prediction cache consistency

[...]

John G. Favor, Korbin S. Van Dyke, David R. Stiles

06 Feb 1991

TL;DR: The branch prediction cache (BPC) as mentioned in this paper provides a tag identifying the address of instructions causing a branch, a record of the target address which was branched to on the last occurrence of each branch instruction, and a copy of the first several instructions beginning at this target address.

...read moreread less

Abstract: The present invention provides for the updating of both the instructions in a branch prediction cache and instructions recently provided to an instruction pipeline from the cache when an instruction being executed attempts to change such instructions ("Store-Into-Instruction-Stream"). The branch prediction cache (BPC) includes a tag identifying the address of instructions causing a branch, a record of the target address which was branched to on the last occurrence of each branch instruction, and a copy of the first several instructions beginning at this target address. A separate instruction cache is provided for normal execution of instructions, and all of the instructions written into the branch prediction cache from the system bus must also be stored in the instruction cache. The instruction cache monitors the system bus for attempts to write to the address of an instruction contained in the instruction cache. Upon such a detection, that entry in the instruction cache is invalidated, and the corresponding entry in the branch prediction cache is invalidated. A subsequent attempt to use an instruction in the branch prediction cache which has been invalidated will detect that it is not valid, and will instead go to main memory to fetch the instruction, where it has been modified.

...read moreread less

279 citations

Proceedings Article•DOI•

The effect of context switches on cache performance

[...]

Jeffrey C. Mogul, Anita Borg

01 Apr 1991

TL;DR: This work fed address traces of the processes running on a multi-tasking operating system through a cache simulator, to compute accurate cache-hit rates over short intervals, and estimated the cache performance reduction caused by a context switch.

...read moreread less

Abstract: The sustained performance of fast processors is critically dependent on cache performance. Cache performance in turn depends on locality of reference. When an operating system switches contexts, the assumption of locality may be violated because the instructions and data of the newly-scheduled process may no longer be in the cache(s). Context-switching thus has a cost above that associated with that of the operations performed by the kernel. We fed address traces of the processes running on a multi-tasking operating system through a cache simulator, to compute accurate cache-hit rates over short intervals. By marking the output of such a simulation whenever a context switch occurs, and then aggregating the post-context-switch results of a large number of context switches, it is possible to estimate the cache performance reduction caused by a switch. Depending on cache parameters the net cost of a context switch appears to be in the thousands of cycles, or tens to hundreds of microseconds. This technical note is a preprint of a paper to appear in the Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.

...read moreread less

272 citations

Proceedings Article•DOI•

High-bandwidth data memory systems for superscalar processors

[...]

Gurindar S. Sohi, Manoj Franklin

01 Apr 1991

TL;DR: Multi-port, nonblocking (MPNB) L1 caches introduced in this paper for the top of the data memory hierarchy appear to be capable of supporting the bandwidth demands of futuregeneration superscalar processors.

...read moreread less

Abstract: This paper considers the design of a data memory hierarchy, with a level 1 (L1) data cache at the top, to support the data bandwidth demands of a future-generation superscalar processor capable of issuing about ten instructions per clock cycle. It introduces the notion of cache bandwidfh — the bandwidth with which a cache can accept requests from the processor — and shows how the bandwidth of a standard, blocking cache, can degrade greatly because of its inability to overlap the service of misses. Non-blocking or lockup-free caches are discussed as a way of reducing the bandwidth degradation due to misses. To improve the data bandwidth to greater than 1 request per cycle, multi-port, interleaved caches are introduced. Simulation results from a cycle-by-cycle simulator, using the MIPS R2000 instruction set, suggest that memory hierarchies with blocking L 1 caches will be unable to support the bandwidth demands of futuregeneration superscalar processors. Multi-port, nonblocking (MPNB) L1 caches introduced in this paper for the top of the data memory hierarchy appear to be capable of supporting such data bandwidth demands.

...read moreread less

215 citations

Patent•

Partially decoded instruction cache

[...]

Donald B. Alpert¹, Dror Avnon¹, Amos Ben-Meir¹, Ran Talmudi¹•Institutions (1)

National Semiconductor¹

16 May 1991

TL;DR: In this paper, a microprocessor partially decodes instructions retrieved from main memory before placing them into the microprocessor's integrated instruction cache, each storage location in the instruction cache includes two slots for decoded instructions.

...read moreread less

Abstract: A microprocessor partially decodes instructions retrieved from main memory before placing them into the microprocessor's integrated instruction cache. Each storage location in the instruction cache includes two slots for decoded instructions. One slot controls one of the microprocessor's integer pipelines and a port to the microprocessor's data cache. A second slot controls the second integer pipeline or one of the microprocessor's floating point units. The instructions retrieved from main memory are decoded by a loader unit which decodes the instructions from the compact form as stored in main memory and places them into the two slots of the instruction cache entry according to their functions. In addition, auxiliary information is placed in the cache entry along with the instruction to control parallel execution as well as emulation of complex instructions. A bit in each instruction cache entry indicates whether the instructions in the two slots are independent, so that they can be executed in parallel, or dependent, so that they must be executed sequentially. Using a single bit for this purpose allows two dependent instructions to be stored in the slots of the single cache entry.

...read moreread less

203 citations

Patent•

Partially decoded instruction cache and method therefor

[...]

Donald B. Alpert¹, Dror Avnon¹, Amos Ben-Meir¹, Ran Talmudi¹•Institutions (1)

National Semiconductor¹

16 May 1991

TL;DR: In this article, a microprocessor partially decodes instructions retrieved from main memory before placing them into the microprocessor's integrated instruction cache, each storage location in the instruction cache includes two slots for decoded instructions.

...read moreread less

Abstract: of EP0459232A microprocessor partially decodes instructions retrieved from main memory before placing them into the microprocessor's integrated instruction cache. Each storage location in the instruction cache includes two slots for decoded instructions. One slot controls one of the microprocessor's integer pipelines and a port to the microprocessor's data cache. A second slot controls the second integer pipeline or one of the microprocessor's floating point units. The instructions retrieved from main memory are decoded by a loader unit which decodes the instructions from the compact form as stored in main memory and places them into the two slots of the instruction cache entry according to their functions. In addition, auxiliary information is placed in the cache entry along with the instruction to control parallel execution as well as emulation of complex instructions. A bit in each instruction cache entry indicates whether the instructions in the two slots are independent, so that they can be executed in parallel, or dependent, so that they must be executed sequentially. Using a single bit for this purpose allows two dependent instructions to be stored in the slots of the single cache entry.

...read moreread less

169 citations

Proceedings Article•DOI•

Data prefetching in multiprocessor vector cache memories

[...]

John Wai Cheong Fu¹, Janak H. Patel¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Apr 1991

TL;DR: This paper reports the cache performance of a set of vectorized numerical program from the Perfect Club benchmarks and describes two simple prefetch schemes to reduce the influence of long stride vector accesses and misses due IO block invalidations in mulliprocessor vector caches.

...read moreread less

Abstract: This paper reports the cache performance of a set of vectorized numerical program from the Perfect Club benchmarks. Using a low cost trace driven simularion technique we show how a non-prefetching vector cache can result in unpredictable performance and how rhis unpredictability makes it difficult to find a good block size. We describe two simple prefetch schemes to reduce the influence of long stride vector accesses and misses due IO block invalidations in mulliprocessor vector caches. These two schemes are shown to have better performance than a non-prefetching cache.

...read moreread less

163 citations

Patent•

Method for managing a cache hierarchy having a least recently used (LRU) global cache and a plurality of LRU destaging local caches containing counterpart datatype partitions

[...]

Richard Lewis Mattson¹•Institutions (1)

IBM¹

20 May 1991

TL;DR: In this paper, the cache hierarchy is logically partitioned to form a least recently used (LRU) global cache and a plurality of LRU destaging local caches, where each local cache is bound to objects having a unique data type T(i), where i is indicative of a DataType.

...read moreread less

Abstract: A method for managing a cache hierarchy having a fixed total storage capacity is disclosed. The cache hierarchy is logically partitioned to form a least recently used (LRU) global cache and a plurality of LRU destaging local caches. The global cache stores objects of all types and maintains them in LRU order. In contrast, each local cache is bound to objects having a unique data type T(i), where i is indicative of a DataType. Read and write accesses by referencing processors or central processing units (CPU's) are made to the global cache. Data not available in the global cache is staged thereto either from one of the local caches or from external storage. When a cache full condition is reached, placement of the most recently used (MRU) data element to the top of the global cache results in an LRU data element of type T(i) being destaged from the global cache to a corresponding one of the local caches storing type T(i) data. Likewise, when a cache full condition is reached in any one or more of the local caches, the local caches in turn will destage their LRU data elements to external storage. The parameters defining the partitions are externally supplied.

...read moreread less

132 citations

Patent•

Coupled memory multiprocessor computer system including cache coherency management protocols

[...]

H. Bruce Butts, David A. Orbits, Kenneth D. Abramson

20 Mar 1991

TL;DR: In this article, a coherent coupled memory multiprocessor computer system with a plurality of processor modules (11a, 11b... ), a global interconnect (13), an optional global memory (15), and an input/output subsystem (17, 19) is disclosed.

...read moreread less

Abstract: A coherent coupled memory multiprocessor computer system that includes a plurality of processor modules (11a, 11b . . . ), a global interconnect (13), an optional global memory (15) and an input/output subsystem (17,19) is disclosed. Each processor module (11a, 11b . . . ) includes: a processor (21); cache memory (23); cache memory controller logic (22); coupled memory (25); coupled memory control logic (24); and a global interconnect interface (27). Coupled memory (25) associated with a specific processor (21), like global memory (15), is available to other processors (21). Coherency between data stored in coupled (or global) memory and similar data replicated in cache memory is maintained by either a write-through or a write-back cache coherency management protocol. The selected protocol is implemented in hardware, i.e., logic, form, preferably incorporated in the coupled memory control logic (24) and in the cache memory controller logic (22). In the write-through protocol, processor writes are propagated directly to coupled memory while invalidating corresponding data in cache memory. In contrast, the write-back protocol allows data owned by a cache to be continuously updated until requested by another processor, at which time the coupled memory is updated and other cache blocks containing the same data are invalidated.

...read moreread less

130 citations

Patent•

Cache affinity scheduler

[...]

Andrew J. Valencia

19 Aug 1991

TL;DR: In this paper, a scheduler allocates engines to processes and schedules the processes to run on the basis of priority and engine availability, which increases computing system performance and reduces bus traffic.

...read moreread less

Abstract: A computing system (50) includes N number of symmetrical computing engines having N number of cache memories joined by a system bus (12). The computing system includes a global run queue (54), an FPA global run queue, and N number of affinity run queues (58). Each engine is associated with one affinity run queue, which includes multiple slots. When a process first becomes runnable, it is typically attached one of the global run queues. A scheduler allocates engines to processes and schedules the processes to run on the basis of priority and engine availability. An engine typically stops running a process before it is complete. When the process becomes runnable again the scheduler estimates the remaining cache context for the process in the cache of the engine. The scheduler uses the estimated amount of cache context in deciding in which run queue a process is to be enqueued. The process is enqueued to the affinity run queue of the engine when the estimated cache context of the process is sufficiently high, and is enqueued onto the global run queue when the cache context is sufficiently low. The procedure increases computing system performance and reduces bus traffic because processes will run on engines having sufficient cache affinity, but will also run on the best available engine when there is insufficient cache context.

...read moreread less

127 citations

Patent•

Shared two level cache including apparatus for maintaining storage consistency

[...]

Patrick Gallagher¹, Gregor Steven Lee¹, Reeve Stephen Michael¹•Institutions (1)

IBM¹

20 Aug 1991

TL;DR: In this paper, a multilevel cache buffer for a multiprocessor system is described, where each processor has a level one cache storage unit which interfaces with a level two cache unit and main storage unit shared by all processors.

...read moreread less

Abstract: A multilevel cache buffer for a multiprocessor system in which each processor has a level one cache storage unit which interfaces with a level two cache unit and main storage unit shared by all processors. The multiprocessors share the level two cache according to a priority algorithm. When data in the level two cache is updated, corresponding data in level one caches is invalidated until it is updated.

...read moreread less

Patent•

Method and apparatus for incorporating cache line replacement and cache write policy information into tag directories in a cache system

[...]

Roger E. Tipley, Philip C. Kelly¹•Institutions (1)

Hewlett-Packard¹

30 Aug 1991

TL;DR: In this article, a method and apparatus for incorporating cache line replacement and cache write policy information into the tag directories in a cache system is presented, which can be generalized to caches which include a number of ways greater than two by using a pseudo-LRU algorithm and utilizing group select bits in each way to distinguish between least recently used groups.

...read moreread less

Abstract: A method and apparatus for incorporating cache line replacement and cache write policy information into the tag directories in a cache system. In a 2 way set-associative cache, one bit in each way's tag RAM is reserved for LRU information, and the bits are manipulated such that the Exclusive-OR of each way's bits points to the actual LRU cache way. Since all of these bits must be read when the cache controller determines whether a hit or miss has occurred, the bits are available when a cache miss occurs and a cache line replacement is required. The method can be generalized to caches which include a number of ways greater than two by using a pseudo-LRU algorithm and utilizing group select bits in each of the ways to distinguish between least recently used groups. Cache write policy information is stored in the tag RAM's to designate various memory areas as write-back or write-through. In this manner, system memory situated on an I/O bus which does not recognize inhibit cycles can have its data cached.

...read moreread less

Patent•

Scheme for insuring data consistency between a plurality of cache memories and the main memory in a multi-processor system

[...]

Michael E. Flynn, Scott Arnold, Stephen J. Delahunt, Tryggve Fossum, Ricky C. Hetherington, David J. Webb - Show less +2 more

09 Jul 1991

TL;DR: In this paper, a method for data consistency between a plurality of individual processor cache memories and the main memory in a multi-processor computer system is provided which is capable of detecting when one of a set of predefined data inconsistency states occurs as a data transaction request is being processed, and correcting the data inconsistencies states so that the operation may be executed in a correct and consistent manner.

...read moreread less

Abstract: A method for insuring data consistency between a plurality of individual processor cache memories and the main memory in a multi-processor computer system is provided which is capable of (1) detecting when one of a set of predefined data inconsistency states occurs as a data transaction request is being processed, and (2) correcting the data inconsistency states so that the operation may be executed in a correct and consistent manner. In particular, the method is adapted to address two kinds of data inconsistency states: (1) A request for a operation from a system unit to main memory when the location to be written to is present in the cache of some processor unit-in such a case, data in the cache is "stale" and the data inconsistency is avoided by preventing the associated processor from using the "stale" data; and (2) when a read operation is requested of main memory by a system unit and the location to be read may be written or has already been written in the cache of some processor--in this case, the data in main memory is "stale" and the data inconsistency is avoided by insuring that the data returned to the requesting unit is the updated data in the cache. The presence of one of the above-described data inconsistency states is detected in a SCU-based multi-processing system by providing the SCU with means for maintaining a copy of the cache directories for each of the processor caches. The SCU continually compares address data accompanying memory access requests with what is stored in the SCU cache directories in order to determine the presence of predefined conditions indicative of data inconsistencies, and subsequently executes corresponding predefined fix-up sequences.

...read moreread less

Patent•

Error transition mode for multi-processor system

[...]

Rebecca L. Stamm, Raymond Strouble, R. Iris Bahar, Michael A. Callander, Linda Chao, Derrick Meyer, Douglas E. Sanders, Richard L Sites, Nicholas D. Wade - Show less +5 more

27 Jun 1991

TL;DR: In this paper, a branch prediction method employs a branch history table which records the taken vs. not-taken history of branch opcodes recently used, and uses an empirical algorithm to predict which way the next occurrence of this branch will go, based upon the history table.

...read moreread less

Abstract: ERROR TRANSITION MODE FOR MULTI-PROCESSOR SYSTEM ABSTRACT OF THE DISCLOSURE A pipelined CPU executing instructions of variable length, and referencing memory using various data widths. Macroinstruction pipelining is employed (instead of microinstruction pipelining), with queueing between units of the CPU to allow flexibility in instruction execution times. A wide bandwidth is available for memory access; fetching 64-bit data blocks on each cycle. A hierarchical cache arrangement has an improved method of cache set selection, increasing the likelihood of a cache hit. A writeback cache is used (instead of writethrough) and writeback is allowed to proceed even though other accesses are suppressed due to queues being full. A branch prediction method employs a branch history table which records the taken vs. not-taken history of branch opcodes recently used, and uses an empirical algorithm to predict which way the next occurrence of this branch will go, based upon the history table. A floating point processor function is integrated on-chip, with enhanced speed due to a bypass technique; a trial mini-rounding is done on low-order bits of the result, and if correct, the last stage of the floating point processor can be bypassed, saving one cycle of latency. For CAL type instructions, a method for determining which registers need to be saved is executed in a minimum number of cycles, examining groups of register mask bits at one time. Internal processor registers are accessed with short (byte width) addresses instead of full physical addresses as used for memory and I/O references, but off-chip processor registers are memory-mapped and accessed by the same busses using the same controls as the memory and I/O. If a non-recoverable error detected by ECC circuits in the cache, an error transition mode is entered wherein the cache operates under limited access rules, allowing a maximum of access by the system for data blocks owned by the cache, but yet minimizing changes to the cache data so that diagnostics may be run. Separate queues are provided for the return data from memory and cache invalidates, yet the order or bus transactions is maintained by a pointer arrangement. The bus protocol used by the CPU to communicate with the system bus is of the pended type, with transactions on the bus identified by an ID field specifying the originator, and arbitration for bus grant goes one simultaneously with address/data transactions on the bus.

...read moreread less

Patent•

Dynamic cache partitioning by modified steepest descent

[...]

Igal Megory-Cohen¹•Institutions (1)

IBM¹

01 Nov 1991

TL;DR: In this paper, a modified steepest descent method is proposed to handle unpredictable local cache activities prior to cache repartitioning to avoid readjustments which would result in unacceptably small or negative cache sizes in cases where a local cache is extremely underutilized.

...read moreread less

Abstract: Dynamic partitioning of cache storage into a plurality of local caches for respective classes of competing processes is performed by a step of dynamically determining adjustments to the cache partitioning using a steepest descent method. A modified steepest descent method allows unpredictable local cache activities prior to cache repartitioning to be taken into account to avoid readjustments which would result in unacceptably small or, even worse, negative cache sizes in cases where a local cache is extremely underutilized. The method presupposes a unimodal distribution of cache misses.

...read moreread less

Patent•

Optimum write-back strategy for directory-based cache coherence protocols

[...]

Sandora Jiyonson Bairaa¹, Kevin P. McAuliffe¹, Bharat Deep Rathi¹•Institutions (1)

IBM¹

23 May 1991

TL;DR: In this paper, a directory-based protocol for maintaining data coherency in a multiprocessing (MP) system having a number of processors with associated write-back caches, a multistage interconnection network (MIN) leading to a shared memory, and a global directory associated with the main memory to keep track of state and control information of cache lines.

...read moreread less

Abstract: A directory-based protocol is provided for maintaining data coherency in a multiprocessing (MP) system having a number of processors with associated write-back caches, a multistage interconnection network (MIN) leading to a shared memory, and a global directory associated with the main memory to keep track of state and control information of cache lines. Upon a request by a requesting cache for a cache line which has been exclusively modified by a source cache, two buffers are situated in the global directory to collectively intercept modified data words of the modified cache line during the write-back to memory. A modified word buffer is used to capture modified words within the modified cache line. Moreover, a line buffer stores an old cache line transferred from the memory, during the write back operation. Finally, the line buffer and the modified word buffer, together, provide the entire modified line to a requesting cache.

...read moreread less

Patent•

Automatic cache bypass for instructions exhibiting poor cache hit ratio

[...]

Jamshed H. Mirza¹•Institutions (1)

IBM¹

15 Apr 1991

TL;DR: In this paper, a cache bypass mechanism automatically avoids caching of data for instructions whose data references, for whatever reason, exhibit low cache hit ratio, and this record is used to decide whether its future references should be cached or not.

...read moreread less

Abstract: A cache bypass mechanism automatically avoids caching of data for instructions whose data references, for whatever reason, exhibit low cache hit ratio. The mechanism keeps a record of an instruction's behavior in the immediate past, and this record is used to decide whether its future references should be cached or not. If an instruction is experiencing bad cache hit ratio, it is marked as non-cacheable, and its data references are made to bypass the cache. This avoids the additional penalty of unnecessarily fetching the remaining words in the line, reduces the demand on the memory bandwidth, avoids flushing the cache of useful data and, in parallel processing environments, prevents line thrashing. The cache management scheme is automatic and requires no compiler or user intervention.

...read moreread less

Patent•

Reconfigurable multi-way associative cache memory

[...]

William Edward Coyle¹, David William Nuechterlein¹, Kim Edward O'donnell¹, Thomas Andrew Sartorius¹, Kenneth D. Schultz¹, Emmy Marion Wolters¹ - Show less +2 more•Institutions (1)

IBM¹

26 Dec 1991

TL;DR: In this paper, a reconfigurable set associative cache memory can be reconfigured from 2x way to 2y way associative memory by merging a predetermined number of least significant bits of the tag field of the main memory address with the line field.

...read moreread less

Abstract: A reconfigurable set associative cache memory can be reconfigured from 2x way to 2y way set associative cache memory by effectively merging a predetermined number of least significant bits of the tag field of the main memory address with the line field of the main memory address. The effective merging is provided by logically merging least significant bits of the tag field with a reconfiguration designation. As a result, Y-X+1 different configurations of cache memory can be obtained using the Y-X least significant bits of the tag field merged with the cache memory address.

...read moreread less

Patent•

Multiprocessor cache snoop access protocol wherein snoop means performs snooping operations after host bus cycle completion and delays subsequent host bus cycles until snooping operations are completed

[...]

Mike T. Jackson, Jeffrey C. Stevens¹, Roger E. Tipley•Institutions (1)

Hewlett-Packard¹

30 Aug 1991

TL;DR: In this article, the cache controller includes a set of latches coupled to the host bus which it uses to latch the state of host bus during a snoop cycle if the cache is unable to immediately snoop that cycle.

...read moreread less

Abstract: A method and apparatus for enabling a dual ported cache system in a multiprocessor system to guarantee snoop access to all host bus cycles which require snooping. The cache controller includes a set of latches coupled to the host bus which it uses to latch the state of the host bus during a snoop cycle if the cache controller is unable to immediately snoop that cycle. The cache controller latches that state of the host bus in the beginning of a cycle and preserves this state throughout the cycle due to the effects of pipelining on the host bus. In addition, the cache controller is able to delay host bus cycles to guarantee snoop access to host bus cycles which require snooping. The cache controller generally only delays a host bus cycle when it is already performing other tasks, such as servicing its local processor, and cannot snoop the host bus cycle immediately. When the cache controller latches the state of the bus during a write cycle, it only begins to delay the host bus after a subsequent cycle begins. In this manner, one write cycle can complete on the host bus before the cache controller delays any cycles, thereby reducing the impact of snooping on host bus bandwidth. Read cycles are always delayed until the cache controller can complete the snooping operation because the cache may be the owner of the data and a write back cycle may be necessary.

...read moreread less

Patent•

Combined queue for invalidates and return data in multiprocessor system

[...]

Gregg Bouchard, Lawrence Chisvin

27 Jun 1991

...read moreread less

Abstract: of EP0465320A pipelined CPU executing instructions of variable length, and referencing memory using various data widths. Macroinstruction pipelining is employed (instead of microinstruction pipelining), with queueing between units of the CPU to allow flexibility in instruction execution times. A wide bandwidth is available for memory access; fetching 64-bit data blocks on each cycle. A hierarchical cache arrangement has an improved method of cache set selection, increasing the likelihood of a cache hit. A writeback cache is used (instead of writethrough) and writeback is allowed to proceed even though other accesses are suppressed due to queues being full. A branch prediction method employs a branch history table which records the taken vs. not-taken history of branch opcodes recently used, and uses an empirical algorithm to predict which way the next occurrence of this branch will go, based upon the history table. A floating point processor function is integrated on-chip, with enhanced speed due to a bypass technique; a trial mini-rounding is done on low-order bits of the result, and if correct, the last stage of the floating point processor can be bypassed, saving one cycle of latency. For CAL type instructions, a method for determining which registers need to be saved is executed in a minimum number of cycles, examining groups of register mask bits at one time. Internal processor registers are accessed with short (byte width) addresses instead of full physical addresses as used for memory and I/O references, but off-chip processor registers are memory-mapped and accessed by the same busses using the same controls as the memory and I/O. If a non-recoverable error detected by ECC circuits in the cache, an error transition mode is entered wherein the cache operates under limited access rules, allowing a maximum of access by the system for data blocks owned by the cache, but yet minimizing changes to the cache data so that diagnostics may be run. Separate queues are provided for the return data from memory and cache invalidates, yet the order or bus transactions is maintained by a pointer arrangement. The bus protocol used by the CPU to communicate with the system bus is of the pended type, with transactions on the bus identified by an ID field specifying the originator, and arbitration for bus grant goes one simultaneously with address/data transactions on the bus.

...read moreread less

Patent•

Selectively locking memory locations within a microprocessor's on-chip cache

[...]

Donald B. Alpert¹, Oved Oz¹, Gideon Intrater¹, Reuven Marko¹, Alon Shacham¹ - Show less +1 more•Institutions (1)

National Semiconductor¹

16 May 1991

TL;DR: In this article, a microprocessor architecture that includes capabilities for locking individual entries into its integrated instruction cache and data cache while leaving the remainder of the cache unlocked and available for use in capturing the microprocessor's dynamic locality of reference is presented.

...read moreread less

Abstract: A microprocessor architecture that includes capabilities for locking individual entries into its integrated instruction cache and data cache while leaving the remainder of the cache unlocked and available for use in capturing the microprocessor's dynamic locality of reference. The microprocessor also includes the capability for locking instruction cache entries without requiring that the instructions be executed during the locking process.

...read moreread less

Patent•

Cache memory apparatus having a plurality of accessibility ports

[...]

Hiroaki Suzuki¹•Institutions (1)

NEC¹

30 Apr 1991

TL;DR: In this paper, a hit discriminator receives first tag information included in an address given when the cache memory is accessed and second tag information read from cache memory in accordance with the given address, in order to discriminate a cache-hitting and a cache missing.

...read moreread less

Abstract: A cache memory apparatus to be coupled to a main memory, comprises a cache memory having a plurality of ports and capable of being independently accessed through the plurality of ports. The cache memory stores a portion of data stored in the main memory and tag information indicating memory locations within the main memory of the data portion stored in the cache memory. A hit discriminator receives first tag information included in an address given when the cache memory is accessed and second tag information read from the cache memory in accordance with the given address, in order to discriminate a cache-hitting and a cache-missing on the basis of the first and second tag information. A replacement control circuit operates for replacing data and corresponding tag information in the cache memory when the cache-missing is discriminated by the hit discriminating circuit. A replacement limiting circuit operates for limiting the replacement of the cache memory to one time when a plurality of accesses to the same address are generated and all the plurality of accesses to the same address are missed.

...read moreread less

Patent•

Cache memory hierarchy having a large write through first level that allocates for CPU read misses only and a small write back second level that allocates for CPU write misses only

[...]

Gregory S. Mathews¹, Edward Zager¹•Institutions (1)

Intel¹

16 Dec 1991

TL;DR: In this article, a cache memory hierarchy with a first level write through cache memory and a second level write back cache memory is provided to a computer system having a CPU, a main memory, and a number of DMA devices.

...read moreread less

Abstract: A cache memory hierarchy having a first level write through cache memory and a second level write back cache memory is provided to a computer system having a CPU, a main memory, and a number of DMA devices. The first level write through cache memory responds to read and write accesses by the CPU, and snoop accesses by the DMA devices, whereas the second level write back cache memory responds to read and write accesses by the CPU as well as the DMA devices. Additionally, the first level write through cache memory is a relatively large cache memory designed to provide a high cache hit rate, whereas the second level write back cache memory is a relatively small cache memory designed to reduce accesses to the main memory. Furthermore, the first level write through cache memory reallocates its cache lines in response to CPU read misses only, whereas the second level write through cache memory reallocates its cache lines in response to CPU write misses only.

...read moreread less

Patent•

Buffer bypass for quick data access

[...]

Jamee Abdulhafiz¹, J. Alvarez Ii Manuel¹, Glenn D. Gilda¹•Institutions (1)

IBM¹

25 Feb 1991

TL;DR: In this paper, the processor accesses the data in the cache memory with one set of clocks while the remainder of the line of data is transferred to the inpage buffer with another set of clock cycles.

...read moreread less

Abstract: A computer system comprises a data processor, a main memory, a cache memory and an inpage buffer. The cache memory is coupled to the main memory to receive data therefrom and is coupled to the processor to transfer data thereto. The inpage buffer is coupled to the main memory to receive data therefrom, coupled to the cache memory to transfer data thereto, and coupled to the processor to transfer data thereto. Part of a line of data is originally transferred to the cache memory bypassing the inpage buffer to give the processor immediate access to the data which it needs. The remainder of the line of data is subsequently transferred to the inpage buffer, and then the processor is given access to the contents of the inpage buffer. The processor accesses the data in the cache memory with one set of clocks while the remainder of the line of data is transferred to the inpage buffer with another set of clocks. The two sets of clocks optimize the operation of tile processor and the main memory. Subsequently, the contents of the inpage buffer are transferred to the cache memory at the start of another inpage operation while the next line of data is fetched from the main memory.

...read moreread less

Patent•

Hit-density-based replacement for data cache with prefetching

[...]

Feng-Hsien W. Shih¹, James Franklin Macon¹, Shauchi Ong¹•Institutions (1)

IBM¹

19 Apr 1991

TL;DR: In this article, the data cache is logically partitioned into two separate sections, demand and prefetch, and a cache directory table and a least recently used table are used to maintain the cache.

...read moreread less

Abstract: A least recently used cache replacement system in which the data cache is logically partitioned into two separate sections, demand and prefetch. A cache directory table and a least recently used table are used to maintain the cache. When a new demand data page is added to the cache, a most recently used (MRU) pointer is updated and points to this new page. When a prefetch page is added to the cache, the least recently used pointer of the demand section is updated with its backward pointer pointing to this new page. A cache hit on a demand of prefetch page moves that page to the top of the least recently used table. When a free page is needed in the cache, it is selected from the demand or prefetch sections of the memory based on a comparison of the demand hit density and the prefetch hit density so to maintain a balance between these two hit densities.

...read moreread less

Patent•

Microprocessor simultaneously issues an access to an external cache over an external cache bus and to an internal cache, cancels the external cache access on an internal cache hit, and reissues the access over a main memory bus on an external cache miss

[...]

Phillip G. Lee¹, Eileen Riggs¹, Gurbir Singh¹, Randy L. Steck¹•Institutions (1)

Intel¹

31 Dec 1991

TL;DR: In this paper, a data processing system which includes a microprocessor fabricated on an integrated circuit chip, a main memory external to the integrated circuit (IC) chip, and a backside cache external to an IC chip is described.

...read moreread less

Abstract: A data processing system which includes a microprocessor fabricated on an integrated circuit chip, a main memory external to the integrated circuit chip, and a backside cache external to the integrated circuit chip. The backside cache includes a directory RAM for storing cache address tag and encoded cache state bits. A first bus connects the microprocessor to the cache, the first bus including backside bus cache directory tags signals comprised of address bits used for a cache hit comparison in the directory RAM and backside bus cache directory state bits for determining a state encoding of a set in the directory RAM. A second bus connects the microprocessor to the main memory. The directory includes means for comparing the cache directory tags on the first bus with the tags stored in the directory and for asserting a Bmiss signal upon the condition that the directory tag stored in the backside bus cache directory do not match the backside bus cache directory tags signals. The microprocessor responds to the Bmiss signal by issuing the access onto the second bus in the event of a cache miss.

...read moreread less

Patent•

Multiport cache memory control unit including a tag memory having plural address ports and a snoop address part

[...]

Ikumi Noriyuki¹•Institutions (1)

Toshiba¹

25 Jan 1991

TL;DR: In this paper, a multiport cache memory control unit includes a central processing unit having N arithmetic units for executing arithmetic processing, a tag memory having N address ports for storing addresses, and a multi-port cache memory with N data ports to store pieces of data at addresses which agree with the addresses stored in the tag memory.

...read moreread less

Abstract: A multiport cache memory control unit includes a central processing unit having N arithmetic units for executing arithmetic processing, a tag memory having N address ports for storing addresses, a multiport cache memory having N data ports for storing pieces of data at addresses which agree with the addresses stored in the tag memory, and a snoop address port through which a snoop operation is executed to detect an address signal. Arithmetic processing is executed in each of the arithmetic units by reading a piece of data from the cache memory after providing an address signal to the tag memory to check whether or not the data is stored in the cache memory. In cases where a cache miss occurs, a piece of data stored in a main memory unit is fetched through the snoop address port without halting the arithmetic processing. In cases where a snoop hit occurs, an address signal provided from another control unit is transmitted to the tag memory through the snoop address port without halting the arithmetic processing.

...read moreread less

Patent•

Predictive historical cache memory

[...]

William R. Crick¹, Walter Johann Jager¹, Michael Lewis Takefman¹, Randal Keith Mullin¹•Institutions (1)

Nortel¹

11 Dec 1991

TL;DR: In this article, a cache memory functioning as a circular buffer for use as a part historical, part predictive cache memory is provided, where the difference between the values in the first and second pointer registers exceeds a predetermined amount.

...read moreread less

Abstract: A cache memory functioning as a circular buffer for use as a part historical, part predictive cache memory is provided. A first register contains data having a value corresponding to a cache memory location of a last instruction executed by a processor and a second register contains data having a value corresponding to a memory location in the cache memory of a last prefetched instruction. Prefetching of instructions from a main memory to the cache memory is disabled if the difference between the values in the first and second pointer registers exceeds a predetermined amount.

...read moreread less

Report•DOI•

The Effect of Garbage Collection on Cache Performance

[...]

Benjamin G. Zorn

01 May 1991

TL;DR: Results suggest that garbage collection algorithms will play an important part in improving cache performance as processor speeds increase and two-way set-associative caches are shown to reduce the miss rate in stop-and-copy algorithms often by a factor of two and sometimes by almost five over direct-mapped caches.

...read moreread less

Abstract: : Cache performance is an important part of total performance in modern computer systems. This paper describes the use of trace-driven simulation to estimate the effect of garbage collection algorithms on cache performance Traces from four large Common Lisp programs have been collected and analyzed with an all-associatively cache simulator. While previous work has focused on the effect of garbage collection on page reference locality this evaluation unambiguously shows that garbage collection algorithms can have a profound effect on cache performance as well. On processors with a direct-mapped cache a generation stop-and-copy algorithm exhibits a miss rate up to four times higher than a comparable generation mark-and-sweep algorithm. Furthermore, two-way set-associative caches are shown to reduce the miss rate in stop-and-copy algorithms often by a factor of two and sometimes by a factor of almost five over direct-mapped caches. As processor speeds increase, cache performance will play an increasing role in total performance. These results suggest that garbage collection algorithms will play an important part in improving that performance.

...read moreread less

Journal Article•DOI•

Two economical directory schemes for large-scale cache coherent multiprocessors

[...]

Yeong-Chang Maa¹, Dhiraj K. Pradhan¹, Dominique Thiebaut²•Institutions (2)

University of Massachusetts Amherst¹, Smith College²

01 Sep 1991-ACM Sigarch Computer Architecture News

TL;DR: Two distributed directory schemes, the tree directory and the hierarchical full-map directory, to deal with the storage overhead problem are presented and should lend themselves to the design and implementation of large-scale cache coherent multiprocessors.

...read moreread less

Abstract: Cache coherence problem is a major issue in the design of shared-memory multiprocessors. As the number of processors grows, traditional bus-based snoopy schemes for cache coherence are no longer adequate. Instead, the directory-based scheme is a promising alternative for the large-scale cache coherence problem. However, the storage overhead of (full-map) directory scheme may become too prohibitive as the system size goes up. This paper presents two distributed directory schemes, the tree directory and the hierarchical full-map directory, to deal with the storage overhead problem. Preliminary trace-driven evaluations show that the performance of our schemes compares favorably to the full-map directory scheme, while reducing the storage overhead by over 90%. These two schemes should lend themselves to the design and implementation of large-scale cache coherent multiprocessors.

...read moreread less