scispace - formally typeset
Search or ask a question

Showing papers on "Cache published in 1986"


Journal ArticleDOI
TL;DR: In highly-pipelined machines, instructions and data are prefetched and buffered in both the processor and the cache as discussed by the authors, which is done to reduce the average memory access latency and to take advantage o...
Abstract: In highly-pipelined machines, instructions and data are prefetched and buffered in both the processor and the cache. This is done to reduce the average memory access latency and to take advantage o...

398 citations


Proceedings ArticleDOI
27 Oct 1986
TL;DR: This work presents new on-line algorithms which decide, for each cache, which blocks to retain and which to drop in order to minimize communication over the bus in a snoopy cache multiprocessor system.
Abstract: In a snoopy cache multiprocessor system, each processor has a cache in which it stores blocks of data. Each cache is connected to a bus used to communicate with the other caches and with main memory. For several of the proposed models of snoopy caching, we present new on-line algorithms which decide, for each cache, which blocks to retain and which to drop in order to minimize communication over the bus. We prove that, for any sequence of operations, our algorithms' communication costs are within a constant factor of the minimum required for that sequence; for some of our algorithms we prove that no on-line algorithm has this property with a smaller constant.

268 citations


Patent
06 Jun 1986
TL;DR: Memory integrity is maintained in a system with a hierarchical memory using a set of explicit cache control instructions as mentioned in this paper, with two status flags, a valid bit and a dirty bit, with each block of information stored.
Abstract: Memory integrity is maintained in a system with a hierarchical memory using a set of explicit cache control instructions. The caches in the system have two status flags, a valid bit and a dirty bit, with each block of information stored. The operating system executes selected cache control instructions to ensure memory integrity whenever there is a possibility that integrity could be compromised.

134 citations


Patent
James Gerald Brenza1
01 May 1986
TL;DR: In this paper, a common directory and an L1 control array (L1CA) are provided for the CPU to access both the L1 and L2 caches, each of which is either a real/absolute address or a virtual address according to whichever address mode the CPU is in.
Abstract: A data processing system which contains a multi-level storage hierarchy, in which the two highest hierarchy levels (e.g. L1 and L2) are private (not shared) to a single CPU, in order to be in close proximity to each other and to the CPU. Each cache has a data line length convenient to the respective cache. A common directory and an L1 control array (L1CA) are provided for the CPU to access both the L1 and L2 caches. The common directory contains and is addressed by the CPU requesting logical addresses, each of which is either a real/absolute address or a virtual address, according to whichever address mode the CPU is in. Each entry in the directory contains a logical address representation derived from a logical address that previously missed in the directory. A CPU request "hits" in the directory if its requested address is in any private cache (e.g. in L1 or L2). A line presence field (LPF) is included in each directory entry to aid in determining a hit in the L1 cache. The L1CA contains L1 cache information to supplement the corresponding common directory entry; the L1CA is used during a L1 LRU castout, but is not the critical path of an L1 or L2 hit. A translation lookaside buffer (TLB) is not used to determine cache hits. The TLB output is used only during the infrequent times that a CPU request misses in the cache directory, and the translated address (i.e. absolute address) is then used to access the data in a synonym location in the same cache, or in main storage, or in the L1 or L2 cache in another CPU in a multiprocessor system using synonym/cross-interrogate directories.

134 citations


Patent
17 Jun 1986
TL;DR: A cache memory capable of concurrently accepting and working on completion of more than one cache access from a plurality of processors connected in parallel is discussed in this paper. But the work in this paper is restricted to the case of a single processor.
Abstract: A cache memory capable of concurrently accepting and working on completion of more than one cache access from a plurality of processors connected in parallel. Current accesses to the cache are handled by current-access-completion circuitry which determines whether the current access is capable of immediate completion and either completes the access immediately if so capable or transfers the access to pending-access-completion circuitry if not so capable. The latter circuitry works on completion of pending accesses; it determines and stores for each pending access status information prescribing the steps required to complete the access and redetermines that status information as conditions change. In working on completion of current and pending accesses, the addresses of the accesses are compared to those of memory accesses in progress on the system.

120 citations


Journal ArticleDOI
01 May 1986
TL;DR: In the design of SPUR, a high-performance multiprocessor workstation, the use of large caches and hardware-supported cache consistency suggests a new approach to virtual address translation, which substantially reduces the hardware cost and complexity of the translation mechanism and eliminates the translation consistency problem.
Abstract: In the design of SPUR, a high-performance multiprocessor workstation, the use of large caches and hardware-supported cache consistency suggests a new approach to virtual address translation. By performing translation in each processor's virtually-tagged cache, the need for separate translation lookaside buffers (TLBs) is eliminated. Eliminating the TLB substantially reduces the hardware cost and complexity of the translation mechanism and eliminates the translation consistency problem. Trace-driven simulations show that normal cache behavior is only minimally affected by caching page table entries, and that in many cases, using a separate device would actually reduce system performance.

115 citations


Proceedings ArticleDOI
01 May 1986
TL;DR: An analytical model for a cache-reload transient is developed and it is shown that the reload transient is related to the area in the tail of a normal distribution whose mean is a function of the footprints of the programs that compete for the cache.
Abstract: This paper develops an analytical model for a cache-reload transient. When an interrupt program or system program runs periodically in a cache-based computer, a short cache-reload transient occurs each time the interrupt program is invoked. That transient depends on the size of the cache, the fraction of the cache used by the interrupt program, and the fraction of the cache used by background programs that run between interrupts. We call the portion of a cache used by a program its footprint in the cache, and we show that the reload transient is related to the area in the tail of a normal distribution whose mean is a function of the footprints of the programs that compete for the cache. We believe that the model may be useful as well for predicting paging behavior in virtual-memory systems with round-robin scheduling.

112 citations


Patent
12 Nov 1986
TL;DR: In this paper, the authors propose a system for maintaining data consistency among distributed processors, each having its associated cache memory, where a processor addresses data in its cache by specifying the virtual address.
Abstract: A system for maintaining data consistency among distributed processors, each having its associated cache memory. A processor addresses data in its cache by specifying the virtual address. The cache will search its cells for the data associatively. Each cell has a virtual address, a real address, flags and a plurality of associated data words. If there is no hit on the virtual address supplied by the processor, a map processor supplies the equivalent real address which the cache uses to access the data from another cache if one has it, or else from real memory. When a processor writes into a data word in the cache, the cache will update all other caches that share the data before allowing the write to the local cache.

106 citations


Journal ArticleDOI
01 May 1986
TL;DR: This paper shows how the VMP design provides the high memory bandwidth required by modern high-performance processors with a minimum of hardware complexity and cost, and describes simple solutions to the consistency problems associated with virtually addressed caches.
Abstract: VMP is an experimental multiprocessor that follows the familiar basic design of multiple processors, each with a cache, connected by a shared bus to global memory. Each processor has a synchronous, virtually addressed, single master connection to its cache, providing very high memory bandwidth. An unusually large cache page size and fast sequential memory copy hardware make it feasible for cache misses to be handled in software, analogously to the handling of virtual memory page faults. Hardware support for cache consistency is limited to a simple state machine that monitors the bus and interrupts the processor when a cache consistency action is required.In this paper, we show how the VMP design provides the high memory bandwidth required by modern high-performance processors with a minimum of hardware complexity and cost. We also describe simple solutions to the consistency problems associated with virtually addressed caches. Simulation results indicate that the design achieves good performance providing data contention is not excessive.

99 citations


Patent
Edmund J. Kelly1
24 Jul 1986
TL;DR: In this article, the row and column addresses are used to access data stored in a dynamic random access memory (DRAM), where the high order bits represent a virtual row address and the low order bits represented a real column address.
Abstract: A memory architecture having particular application for use in computer systems employing virtual memory techniques. A processor provides row and column addresses to access data stored in a dynamic random access memory (DRAM). The virtual address supplied by the processor includes high and low order bits. In the present embodiment, the high order bits represent a virtual row address and the low order bits represent a real column address. The virtual row address is applied to a memory management unit (MMU) for translation into a real row address. The real column address need not be translated. A comparator compares the current virtual row address to the previous row address stored in a latch. If the current row and previous row addresses match, a cycle control circuit couples the real column address to the DRAM, and applies a strobe signal such that the desired data is accessed in the memory without the need to reapply the row address. If the row addresses do not match, the cycle control circuit initiates a complete memory fetch cycle and applies both row and column addresses to the DRAM, along with the respective strobe signals. By properly organizing data in the memory, the probability that sequential memory operations access the same row in the DRAM may be significantly increased. By using such an organization, the present invention provides data retrieval at speeds on the order of a cache based memory system for a subset of data stored.

94 citations


Patent
25 Jul 1986
TL;DR: In this article, a method and apparatus for marking data that is temporarily cacheable to facilitate the efficient management of said data is presented, where a bit in the segment and/or page descriptor of the data called the marked data bit (MDB) is generated by the compiler and included in a request for data from memory by the processor in the form of a memory address and will be stored in the cache directory at a location related to the particular line of data involved.
Abstract: A method and apparatus for marking data that is temporarily cacheable to facilitate the efficient management of said data. A bit in the segment and/or page descriptor of the data called the marked data bit (MDB) is generated by the compiler and included in a request for data from memory by the processor in the form of a memory address and will be stored in the cache directory at a location related to the particular line of data involved. The bit is passed to the cache together with the associated real address after address translation (in the case of a real cache). when the cache controls load the address of the data in the directory it is also stored the marked data bit (MDB) in the directory with the address. When the cacheability of the temporarily cacheable data changes from cacheable to non-cacheable, a single instruction is issued to cause the cache to invalidate all marked data. When an "invalidate marked data" instruction is received, the cache controls sweep through the entire cache directory and invalidate any cache line which has the "marked data bit" set in a single pass. An extension of the invention involves using a multi-bit field rather than a single bit to provide a more versatile control of the temporary cacheability of data.

Patent
27 Mar 1986
TL;DR: In this paper, the authors propose to use a cache memory and a main memory with a transformation unit between the main memory and the cache memory so that at least a portion of an information unit retrieved from the cache may be transformed during retrieval of the information (fetch) from a mainmemory and prior to storage in the cache (cache).
Abstract: A computer having a cache memory and a main memory is provided with a transformation unit between the main memory and the cache memory so that at least a portion of an information unit retrieved from the main memory may be transformed during retrieval of the information (fetch) from a main memory and prior to storage in the cache memory (cache). In a specific embodiment, an instruction may be predecoded prior to storage in the cache memory. In another embodiment involving a branch instruction, the address of the target of the branch is calculated prior to storing in the instruction cache. The invention has advantages where a particular instruction is repetitively executed since a needed decode operation which has been partially performed previously need not be repeated with each execution of an instruction. Consequently, the latency time of each machine cycle may be reduced, and the overall efficiency of the computing system can be improved. If the architecture defines delayed branch instructions, such branch instructions may be executed in effectively zero machine cycles. This requires a wider bus and an additional register in the processor to allow the fetching of two instructions from the cache memory in the same cycle.

Patent
Lishing Liu1
12 Sep 1986
TL;DR: In this article, a method and apparatus for associating in cache directories the Control Domain Identifications (CDIDs) of software covered by each cache line is provided, through the use of such provision and/or the addition of Identifications of users actively using lines, cache coherence of certain data is controlled without performing conventional Cross-Interrogates (XIs), if the accesses to such objects are properly synchronized with locking type concurrency controls.
Abstract: A method and apparatus is provided for associating in cache directories the Control Domain Identifications (CDIDs) of software covered by each cache line. Through the use of such provision and/or the addition of Identifications of users actively using lines, cache coherence of certain data is controlled without performing conventional Cross-Interrogates (XIs), if the accesses to such objects are properly synchronized with locking type concurrency controls. Software protocols to caches are provided for the resource kernel to control the flushing of released cache lines. The parameters of these protocols are high level Domain Identifications and Task Identifications.

Patent
03 Oct 1986
TL;DR: In this paper, a cache memory system with multiple-word boundary registers, multipleword line registers, and a multipleword boundary detector system is presented, where the cache memory stores four words per addressable line of cache storage.
Abstract: In a cache memory system, multiple-word boundary registers, multiple-word line registers, and a multiple-word boundary detector system provide accelerated access of data contained within the cache memory within the multiple-word boundaries, and provides for effective prefetch of sequentially ascending locations of stored data from the cache memory. In an illustrated embodiment, the cache memory stores four words per addressable line of cache storage, and accordingly quad-word boundary registers determine boundary limits on quad-words, quad-word line registers store, in parallel, a selected line from the cache memory, and a quad-word boundary detector system determines when to prefetch the next set of quad-words from the cache memory for storage in the quad-word line registers.

Patent
21 Feb 1986
TL;DR: In this article, a cache and memory management system architecture and associated protocol is described, which is comprised of a set associative memory cache subsystem, a set associated translation logic memory subsystem, hardwired page translation, selectable access mode logic, and selectively enableable instruction prefetch logic.
Abstract: A cache and memory management system architecture and associated protocol is disclosed. The cache and memory management system is comprised of a set associative memory cache subsystem, a set associative translation logic memory subsystem, hardwired page translation, selectable access mode logic, and selectively enableable instruction prefetch logic. The cache and memory management system includes a system interface for coupling to a systems bus to which a main memory is coupled, and is also comprised of a processor/cache bus interface for coupling to an external CPU. As disclosed, the cache memory management system can function as either an instruction cache with instruction prefetch capability, and on-chip program counter capabilities, and as a data cache memory management system which has an address register for receiving addresses from the CPU, to initiate a transfer of defined numbers of words of data commencing at the transmitted address. Another novel feature disclosed is the quadword boundary, quadword line registers, and quadword boundary detector subsystem, which accelerates access of data within quadword boundaries, and provides for effective prefetch of sequentially ascending locations of storage instructions or data from the cache memory subsystem.

Journal ArticleDOI
01 May 1986
TL;DR: This work describes a protocol that is currently exploring in a cache synchronization scheme for a broadcast system, and analyzes the evolution of options that have been proposed under write-in (or write-back) policy.
Abstract: Many options are possible in a cache synchronization (or consistency) scheme for a broadcast system. We clarify basic concepts, analyze the handling of shared data, and then describe a protocol that we are currently exploring. Finally, we analyze the evolution of options that have been proposed under write-in (or write-back) policy. We show how our protocol extends this evolution with new methods for efficient busy-wait locking, waiting, and unlocking. The lock scheme allows locking and unlocking to occur in zero time, eliminating the need for test-and-set. The scheme also integrates processor atomic read-modify-write instructions and programmer/compiler busy-wait-synchronized operations under the same mechanism. The wait scheme eliminates all unsuccessful retries from the bus, and allows a process to work while waiting.

Patent
16 Oct 1986
TL;DR: In this article, a pipelined digital computer processor system is provided comprising an instruction prefetch unit (IPU,2) for prefetching instructions and an arithmetic logic processing unit (ALPU, 4) for executing instructions.
Abstract: A pipelined digital computer processor system (10, FIG. 1) is provided comprising an instruction prefetch unit (IPU,2) for prefetching instructions and an arithmetic logic processing unit (ALPU, 4) for executing instructions. The IPU (2) has associated with it a high speed instruction cache (6), and the ALPU (4) has associated with it a high speed operand cache (8). Each cache comprises a data store (84, 94, FIG. 3) for storing frequently accessed data, and a tag store (82, 92, FIG. 3) for indicating which main memory locations are contained in the respective cache. The IPU and ALPU processing units (2, 4) may access their associated caches independently under most conditions. When the ALPU performs a write operation to main memory, it also updates the corresponding data in the operand cache and, if contained therein, in the instruction cache permitting the use of self-modifying code. The IPU does not write to either cache. Provision is made for clearing the caches on certain conditions when their contents become invalid.

Patent
06 Feb 1986
TL;DR: In this article, an address translation unit is included on the same chip as, and logically between, the address generating unit and the tag comparator logic, and interleaved access to more than one cache may be accomplished on the external address, data and tag busses.
Abstract: A cache-based computer architecture has the address generating unit and the tag comparator packaged together and separately from the cache RAMS. If the architecture supports virtual memory, an address translation unit may be included on the same chip as, and logically between, the address generating unit and the tag comparator logic. Further, interleaved access to more than one cache may be accomplished on the external address, data and tag busses.



Patent
29 Jul 1986
TL;DR: In this paper, a prefetch buffer is provided along with a pre-configuration register and control logic for splitting the prefetch buffers into two logical channels, a first channel for handling prefetches associated with requests from the first processor, and a second channel for dealing with requests of the second processor.
Abstract: Control logic for controlling references to a cache (24) including a cache directory (62) which is capable of being configured into a plurality of ways, each way including tag and valid-bit storage for associatively searching the directory (62) for cache data-array addresses. A cache-configuration register and control logic (64) splits the cache directory (62) into two logical directories, one directory for controlling requests from a first processor and the other directory for controlling requests from a second processor. A prefetch buffer (63) is provided along with a prefetch control register for splitting the prefetch buffer into two logical channels, a first channel for handling prefetches associated with requests from the first processor, and a second channel for handling prefetches associated with requests from the second processor.

Patent
James W. Keeley1
27 Jun 1986
TL;DR: In this paper, a cache memory subsystem has multilevel directory memory and buffer memory pipeline stages shared by at least a pair of independently operated central processing units, each processing unit is allocated one-half of the total available cache memory space by separate accounting replacement apparatus included within the buffer memory stage.
Abstract: A cache memory subsystem has multilevel directory memory and buffer memory pipeline stages shared by at least a pair of independently operated central processing units. For completely independent operation, each processing unit is allocated one-half of the total available cache memory space by separate accounting replacement apparatus included within the buffer memory stage. A multiple allocation memory (MAM) is also included in the buffer memory stage. During each directory allocation cycle performed for a processing unit, the allocated space of the other processing unit is checked for the presence of a multiple allocation. The address of the multiple allocated location associated with the processing unit having the lower priority is stored in the MAM allowing for earliest data replacement thereby maintaining data coherency between both independently operated processing units.

Proceedings Article
01 Jan 1986
TL;DR: The goal is for the cache behavior to dominate but the large storage facility to dominat osts, thus giving the illusion of a large, fast, inexpensive storage system.
Abstract: m tively slow and inexpensive source of information and a uch faster consumer of that information. The cache c capacity is relatively small and expensive, but quickly ac essible. The goal is for the cache behavior to dominate e c performance but the large storage facility to dominat osts, thus giving the illusion of a large, fast, inexpensive a s storage system. Successful operation depends both on ubstantial level of locality being exhibited by the consut mer, and careful strategies being chosen for cache opera ion disciplines for replacement of contents, update syn-

Patent
03 Oct 1986
TL;DR: In this article, a very high speed instruction and data interface circuitry for coupling via respective separate very high-speed instruction and interface buses to respective external instruction cache and data cache circuitry is described.
Abstract: A microprocessor architecture is disclosed having separate very high speed instruction and data interface circuitry for coupling via respective separate very high speed instruction and data interface buses to respective external instruction cache and data cache circuitry. The microprocessor is comprised of an instruction interface, a data interface, and an execution unit. The instruction interface controls communications with the external instruction cache and couples the instructions from the instruction cache to the microprocessor at very high speed. The data interface controls communications with the external data cache and communicates data bidirectionally at very high speed between the data cache and the microprocessor. The execution unit selectively processes the data received via the data interface from the data cache responsive to the execution unit decoding and executing a respective one of the instructions received via the instruction interface from the instruction cache. In one embodiment, the external instruction cache is comprised of a program counter and addressable memory for outputting stored instructions responsive to its program counter and to an instruction cache advance signal output from the instruction interface. Circuitry in the instruction interface selectively outputs an initial instruction address for storage in the instruction cache program counter responsive to a context switch or branch, such that the instruction interface repetitively couples a plurality of instructions from the instruction cache to the microprocessor responsive to the cache advance signal, independent of and without the need for any intermediate or further address output from the instruction interface to the instruction cache except upon the occurrence of another context switch or branch.

Patent
06 Aug 1986
TL;DR: Cache memory includes a dual or two-part cache with one part of the cache being primarily designated for instruction data while the other part is designated for operand data, but not exclusively.
Abstract: Cache memory includes a dual or two part cache with one part of the cache being primarily designated for instruction data while the other is primarily designated for operand data, but not exclusively. For a maximum speed of operation, the two parts of the cache are equal in capacity. The two parts of the cache, designated I-Cache and O-Cache, are semi-independent in their operation and include arrangements for effecting synchronized searches, they can accommodate up to three separate operations substantially simultaneously. Each cache unit has a directory and a data array with the directory and data array being separately addressable. Each cache unit may be subjected to a primary and to one or more secondary concurrent uses with the secondary uses prioritized. Data is stored in the cache unit on a so-called store-into basis wherein data obtained from the main memory is operated upon and stored in the cache without returning the operated upon data to the main memory unit until subsequent transactions require such return.

Patent
20 Oct 1986
TL;DR: In this article, the cache memory is not accessable other than from cache memory, and the main memory access delay is not caused by requests from other system modules such as the I/O controller.
Abstract: A computer system in which only the cache memory is permitted to communicate with main memory and the same address being used in the cache is also sent at the same time to the main memory. Thus, as soon as it is discovered that the desired main memory address is not presently in the cache, the main memory RAMs can be read to the cache without being delayed by the main memory address set up time. In addition, since the main memory is not accessable other than from the cache memory, there is also no main memory access delay caused by requests from other system modules such as the I/O controller. Likewise, since the contents of the cache memory is written into a temporary register before being sent to the main memory, a main memory read can be performed before doing a writeback of the cache to the main memory, so that data can be back to the cache in approximately the same amount of time required for a normal main memory access. The result is a significant reduction in the overhead time normally associated with cache memories.

Patent
28 May 1986
TL;DR: In this article, a cache system employing an LRU (Least Recently Used) scheme in a replacement algorithm of cache blocks and comprising a directory memory whose entires have LRU counter fields, a host system issuing a read/write command to which an arbitrary LRU setting value is appended, a directory search circuit, and a microprocessor.
Abstract: A cache system employing an LRU (Least Recently Used) scheme in a replacement algorithm of cache blocks and comprising a directory memory whose entires have LRU counter fields, a host system issuing a read/write command to which an arbitrary LRU setting value is appended, a directory search circuit, and a microprocessor. The directory search circuit searches the directory memory in response to the read/write command issued from the host system. The microprocessor stores the LRU setting value appended to the read/write command in the LRU counter field of the hit entry of the directory memory or of the entry corresponding to the replacement target cache block, in response to the search result of the directory search circuit.

Patent
31 Jul 1986
TL;DR: In this paper, the cache memory is divided into two valid bits associated with the user execution space and the supervisor or operating system execution space, and each collection of valid bits can be cleared in unison independently of the other.
Abstract: Each entry in a cache memory located between a processor and an MMU has two valid bits. One valid bit is associated with the user execution space and the other with the supervisor or operating system execution space. Each collection of valid bits can be cleared in unison independently of the other. This allows supervisor entries in the cache to survive context changes without being purged along with the user entries.

Proceedings ArticleDOI
01 May 1986
TL;DR: It is shown that the logical problem of buffering is directly related to the problem of synchronization, and a simple model is presented to evaluate the performance improvement resulting from buffering.
Abstract: In highly-pipelined machines, instructions and data are prefetched and buffered in both the processor and the cache. This is done to reduce the average memory access latency and to take advantage of memory interleaving. Lock-up free caches are designed to avoid processor blocking on a cache miss. Write buffers are often included in a pipelined machine to avoid processor waiting on writes. In a shared memory multiprocessor, there are more advantages in buffering memory requests, since each memory access has to traverse the memory- processor interconnection and has to compete with memory requests issued by different processors. Buffering, however, can cause logical problems in multiprocessors. These problems are aggravated if each processor has a private memory in which shared writable data may be present, such as in a cache-based system or in a system with a distributed global memory. In this paper, we analyze the benefits and problems associated with the buffering of memory requests in shared memory multiprocessors. We show that the logical problem of buffering is directly related to the problem of synchronization. A simple model is presented to evaluate the performance improvement resulting from buffering.