scispace - formally typeset
Search or ask a question

Showing papers on "Cache pollution published in 1988"


Proceedings ArticleDOI
17 May 1988
TL;DR: In this article, the cache coherence in shared-memory multiprocessors has been addressed using two basic approaches: directory schemes and snoopy cache schemes, which have been given less attention in the past several years.
Abstract: The problem of cache coherence in shared-memory multiprocessors has been addressed using two basic approaches: directory schemes and snoopy cache schemes. Directory schemes have been given less attention in the past several years, while snoopy cache methods have become extremely popular. Directory schemes for cache coherence are potentially attractive in large multiprocessor systems that are beyond the scaling limits of the snoopy cache schemes. Slight modifications to directory schemes can make them competitive in performance with snoopy cache schemes for small multiprocessors. Trace driven simulation, using data collected from several real multiprocessor applications, is used to compare the performance of standard directory schemes, modifications to these schemes, and snoopy cache protocols.

525 citations


Journal ArticleDOI
TL;DR: A program tracing technique called ATUM (Address Tracing Using Microcode) is developed that captures realistic traces of multitasking workloads including the operating system that shows that both the operating System and multiprogramming activity significantly degrade cache performance, with an even greater proportional impact on large caches.
Abstract: Large caches are necessary in current high-performance computer systems to provide the required high memory bandwidth. Because a small decrease in cache performance can result in significant system performance degradation, accurately characterizing the performance of large caches is important. Although measurements on actual systems have shown that operating systems and multiprogramming can affect cache performance, previous studies have not focused on these effects. We have developed a program tracing technique called ATUM (Address Tracing Using Microcode) that captures realistic traces of multitasking workloads including the operating system. Examining cache behavior using these traces from a VAX processor shows that both the operating system and multiprogramming activity significantly degrade cache performance, with an even greater proportional impact on large caches. From a careful analysis of the causes of this degradation, we explore various techniques to reduce this loss. While seemingly little can be done to mitigate the effect of system references, multitasking cache miss activity can be substantially reduced with small hardware additions.

244 citations


Proceedings ArticleDOI
17 May 1988
TL;DR: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies and a new inclusion-coherence mechanism for two-level bus-based architectures is proposed.
Abstract: The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative caches which allow different block sizes at different levels of the hierarchy. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads us to propose a new inclusion-coherence mechanism for two-level bus-based architectures.

236 citations


Patent
27 Jun 1988
TL;DR: In this article, a pipelined central processor capable of executing both single-cycle instructions and multicycle instructions is provided, which includes an instruction cache memory and a prediction cache memory that are commonly addressed by a program counter register.
Abstract: A pipelined central processor capable of executing both single-cycle instructions and multicycle instructions is provided. An instruction fetch stage of the processor includes an instruction cache memory and a prediction cache memory that are commonly addressed by a program counter register. The instruction cache memory stores instructions of a program being executed and microinstructions of a multicycle instruction interpreter. The prediction cache memory stores interpreter call predictions and interpreter entry addresses at the addresses of the multicycle intructions. When a call prediction occurs, the entry address of the instruction interpreter is loaded into the program counter register on the processing cycle immediately following the call prediction, and a return address is pushed onto a stack. The microinstructions of the interpreter are fetched sequentially from the instruction cache memory. When the interpreter is completed, the prediction cache memory makes a return prediction. The return address is transferred from the stack to the program counter register on the processing cycle immediately following the return prediction, and normal program flow is resumed. The prediction cache memory also stores branch instruction predictions and branch target addresses.

140 citations


Journal ArticleDOI
K. So1, R.N. Rechtschaffen1
TL;DR: In this paper, the performance of set associative caches is analyzed by grouping the cache lines into regions according to their positions in the replacement stacks of a cache, and observing how the memory access of a CPU is distributed over these regions.
Abstract: The performance of set associative caches is analyzed The method used is to group the cache lines into regions according to their positions in the replacement stacks of a cache, and then to observe how the memory access of a CPU is distributed over these regions Results from the preserved CPU traces show that the memory accesses are heavily concentrated on the most recently used (MRU) region in the cache The concept of MRU change is introduced; the idea is to use the event that the CPU accesses a non-MRU line to approximate the time the CPU is changing its working set The concept is shown to be useful in many aspects of cache design and performance evaluation, such as comparison of various replacement algorithms, improvement of prefetch algorithms, and speedup of cache simulation >

138 citations


Journal ArticleDOI
17 May 1988
TL;DR: This paper presents a series of simulations that explore the interactions between various organizational decisions and program execution time, and investigates the tradeoffs between cache size and CPU/Cache cycle time, set associativity and cycleTime, and between block size and main memory speed.
Abstract: Cache memories have become common across a wide range of computer implementations. To date, most analyses of cache performance have concentrated on time independent metrics, such as miss rate and traffic ratio. This paper presents a series of simulations that explore the interactions between various organizational decisions and program execution time. We investigate the tradeoffs between cache size and CPU/Cache cycle time, set associativity and cycle time, and between block size and main memory speed. The results indicate that neither cycle time nor cache size dominates the other across the entire design space. For common implementation technologies, performance is maximized when the size is increased to the 32KB to 128KB range with modest penalties to the cycle time. If set associativity impacts the cycle time by more than a few nanoseconds, it increases overall execution time. Since the block size and memory transfer rate combine to affect the cache miss penalty, the optimum block size is substantially smaller than that which minimizes the miss rate. Finally, the interdependence between optimal cache configuration and the main memory speed necessitates multi-level cache hierarchies for high performance uniprocessors.

134 citations


Patent
11 Jan 1988
TL;DR: In this paper, a vector processor is added to a digital computing system including a scalar processor, a virtual address translation buffer, a main memory and a cache, and a vector prefetch request is sent to the main memory.
Abstract: A main memory and cache suitable for scalar processing are used in connection with a vector processor by issuing prefetch requests in response to the recognition of a vector load instruction A respective prefetch request is issued for each block containing an element of the vector to be loaded from memory In response to a prefetch request, the cache is checked for a "miss" and if the cache does not include the required block, a refill request is sent to the main memory The main memory is configured into a plurality of banks and has a capability of processing multiple references Therefore the different banks can be referenced simultaneously to prefetch multiple blocks of vector data Preferably a cache bypass is provided to transmit data directly to the vector processor as the data from the main memory are being stored in the cache In a preferred embodiment, a vector processor is added to a digital computing system including a scalar processor, a virtual address translation buffer, a main memory and a cache The scalar processor includes a microcode interpreter which sends a vector load command to the vector processing unit and which also generates vector prefetch requests The addresses for the data blocks to be prefetched are computed based upon the vector address, the length of the vector and the "stride" or spacing between the addresses of the elements of the vector

119 citations


Patent
02 Sep 1988
TL;DR: In this paper, a page-mapped I/O cache structure is proposed to reduce cache coherency in a multiprocessing system, which ensures that every access to a line of data is the most up-to-date copy of that line without storing cache-coherency status bits in a global memory and any reference thereto.
Abstract: A multiprocessing system includes a cache coherency technique that ensures that every access to a line of data is the most up-to-date copy of that line without storing cache coherency status bits in a global memory and any reference thereto. An operand cache includes a first directory which directly, on a one-to-one basis maps a range of physical address bits into a first section of the operand cache storage. An associative directory multiply maps physical addresses outside of the range into a second section of the operand cache storage section. All stack frames of user programs to be executed on the time-shared basis are stored in the first section, so cache misses due to stack operations are avoided. An instruction cache haivng various categories of instructions stores a group of status bits identifying the instruction category with each instruction. When a context switch occures, only instructions of the category least likely to be used in the near future are cleared decreasing delays due to clearing of the instruction cache as a result of context switches. A page-mapped I/O cache structure interfaces by a large number of I/O channels which regard a single I/O cache as an exclusive buffer. System operating delays due to maintaining cache coherency, operand cache misses, instruction cache misses, I/O cache misses, and maintaining a cache coherency are substantially reduced.

118 citations


Proceedings ArticleDOI
27 Mar 1988
TL;DR: A simple, conservative analysis of the performance results of simulated caches for a gateway at MIT shows that current gateway routing-table lookup time could be reduced by up to 65%.
Abstract: A way to increase gateway throughput is to reduce the routing-table lookup time per packet. A routing-table cache can be used to reduce the average lookup time per packet and the purpose of this study is to determine the best management policies for this cache as well as its measured performance. The performance results of simulated caches for a gateway at MIT are presented. These results include the probability of reference versus previous access time, cache hit ratios, and the number of packets between cache misses. A simple, conservative analysis using the presented measurements shows that current gateway routing-table lookup time could be reduced by up to 65%. >

118 citations


Patent
27 May 1988
TL;DR: In this paper, a message-driven concurrent computer system stores incoming messages in a row buffer and then in a queue in main memory, and output from the cache is through a set of comparators.
Abstract: A message-driven concurrent computer system stores incoming messages in a row buffer and then in a queue in main memory. A translator cache is also located in main memory, and output from the cache is through a set of comparators. Both the queue and cache are addressed in a wraparound fashion by hardware. An instruction buffer holds an entire row of instructions from memory. Translate, suspend and send instructions are available to the user. Tags provide for synchronization when objects are retrieved from remote processors and identify addresses as being physical addresses of a local processor or a node address of a remote processor.

102 citations


Proceedings ArticleDOI
01 Oct 1988
TL;DR: A method for using data dependence analysis to estimate cache and local memory demand in highly iterative scientific codes in the form of a family of “reference” windows for each variable that reflects the current set of elements that should be kept in cache.
Abstract: In this paper we describe a method for using data dependence analysis to estimate cache and local memory demand in highly iterative scientific codes. The estimates take the form of a family of “reference” windows for each variable that reflects the current set of elements that should be kept in cache. It is shown that, in important special cases, we can estimate the size of the window and predict a lower bound on the number of cache hits. If the machine has local memory or cache that can be managed by the compiler, these estimates can be used to guide the management of this resource. It is also shown that these estimates can be used to guide program transformations in an attempt to optimize cache performance.

Journal ArticleDOI
17 May 1988
TL;DR: This work proposes a new solution that offers the fast operation of the indiscriminate invalidation approach and can selectively invalidate cache items without extensive run-time book-keeping and checking and relies on the combination of compile-time reference tagging and individual invalidation of potentially stale cache lines only when referenced.
Abstract: Software-assisted cache coherence enforcement schemes for large multiprocessor systems with shared global memory and interconnection network have gained increasing attention. Proposed software-assisted approaches rely on either indiscriminate invalidation or selective invalidation to invalidate stale cache lines. The indiscriminate approach combined with advanced memory hardware can quickly invalidate the entire cache but may result in lower hit ratios. The selective approach may achieve a better hit ratio. However, sequential selection and invalidation of cache or TLB entries is time consuming. We propose a new solution that offers the fast operation of the indiscriminate invalidation approach and can selectively invalidate cache items without extensive run-time book-keeping and checking. The solution relies on the combination of compile-time reference tagging and individual invalidation of potentially stale cache lines only when referenced. Performance improvement over an indiscriminate invalidation approach is presented.

Patent
Lishing Liu1
10 Jun 1988
TL;DR: In this paper, a simple sequential prefetching algorithm based on simple histories is proposed, where each memory line in cache memory is associated with a bit in an S-vector, which is called the S-bit for the line.
Abstract: A computer memory management method for cache memory (10) uses a deconfirmation technique to provide a simple sequential prefetching algorithm. Access sequentiality is predicted based on simple histories. Each memory line in cache memory is associated with a bit in an S-vector (20), which is called the S-bit for the line. When the S-bit is on, sequentiality is predicted meaning that the sequentially next line is regarded as a good candidate for prefetching, if that line is not already in the cache memory. The key to the operation of the memory management method is the manipulation (turning on and off) the S-bits.


Patent
01 Apr 1988
TL;DR: In this paper, the authors present a memory system that determines which blocks of a set of associative blocks in cache memory are unavailable for replacement by maintaining a duplicate set of tags which track block ownership for this cache pursuant to a "snoopy" protocol.
Abstract: This invention is directed to a memory system that determines which blocks of a set of associative blocks in cache memory are unavailable for replacement. This is accomplished by operating the memory system to maintain a duplicate set of tags which track block ownership for this cache pursuant to a "snoopy" protocol. In addition, the cache system maintains a bit associated with each memory address to indicate whether any data blocks resident in it have been locked. The interlock status of the data blocks in the cache is not communicated to the memory system. Once a block is locked, it cannot be allocated for replacement until it is unlocked. When the cache system encounters a locked block, it skips over that block and allocates the next block of the associative blocks. From this, the memory system infers, by means of a replacement algorithm, that block is being locked and, therefore, cannot be replaced. This enables the memory system to implement an irregular replacement policy for this cache when the block to be replaced is owned and locked.

Patent
22 Feb 1988
TL;DR: In this article, an on-chip VLSI cache architecture including a single-port, last-select, cache array organized as an n-way set-associative cache (with n congruence classes) is presented.
Abstract: An on-chip VLSI cache architecture including a single-port, last-select, cache array organized as an n-way set-associative cache (having n congruence classes) including a plurality of functionally integrated units on-chip in addition to the cache array and including a normal read/write CPU access function which provides an architectural organization for allowing the chip to be used in (1) a fast, "late-select" operation which may be provided with any desired degree of set-associativity while achieving an effective one-cycle write operation, and (2) a cache reload function which provides a highly parallel store-back and reload operation to substantially reduce the reload time, particularly for a store-in cache organization. The cache chip organization and architecture provide a late-select cache having a nearly transparent, multiple word reload by incorporating a Cache-Reload Buffer, a store-back buffer and a load-through function all included on the cache array chip for reloading, and a delayed write-enable for achieving an effective one-cycle write operation. Two separate decoder functions are integrated on the chip, one for cache access for normal read/write operations to and from the CPU and one for cache reload which also provides interim access to data which has been transferred out of main memory to the chip but not yet reloaded into the cache array. These two decoders provide for different accessing modes as required of the CPU or main memory operations.

Proceedings ArticleDOI
01 Jun 1988
TL;DR: A solution to the cache coherence problem specifically for shared bus multiprocessors that adapts dynamically to the reference pattern is presented and one of the first solutions of this kind is an extension of the adaptive shared bus approach described in this paper.
Abstract: This paper explores the architecture of high-performance large scale multiprocessors using private caches for each processor. The caches reduce the average memory access time, but they also result in the well known cache coherence problem. Multiple copies of each memory location are allowed to exist but they must be kept consistent with each other. In this paper, we present a solution to the cache coherence problem specifically for shared bus multiprocessors that adapts dynamically to the reference pattern. Simulation results are presented that demonstrate the high level of performance relative to other protocols particularly during intervals with high levels of sharing.The paper then presents a coherence solution for large multiprocessor systems organized around a hierarchy of buses. One of the first solutions of this kind, the hierarchical protocol is an extension of the adaptive shared bus approach described in this paper.

Patent
12 Apr 1988
TL;DR: A vector processing computer (20) as mentioned in this paper includes a memory control unit (22), main memory (99), a central processor (156), a service processing unit (42), and a plurality of input/output processors (54, 68).
Abstract: A vector processing computer (20) includes a memory control unit (22), main memory (99), a central processor (156), a service processing unit (42) and a plurality of input/output processors (54, 68). The central processor (156) includes a physical cache unit (100), an address translation unit (118), an instruction processing unit (126), an address scalar unit (142), a vector control unit (144), an odd pipe vector processing unit (148) and an even pipe vector processing unit (150). Vector elements are transmitted from memory, either main memory (99), a physical cache unit (100) or a logical cache (326) through a source bus (114) where the elements are alternately loaded into the vector processing units (148, 150). The resulting vectors are transmitted through a destination bus (114) to either the physical cache unit (100), the main memory (99), the logical cache (326) or to an input/output processor (54). In a still further aspect of the computer (20) there is included the logical data cache (326) which stores data at logical addresses such that the central processor (156) can store and retrieve data without the necessity of first making a translation from logical to physical address.

Patent
25 Jul 1988
TL;DR: In this paper, the authors propose a load/store pipeline in a computer processor for loading data to registers and storing data from the registers has a cache memory within the pipeline for storing data.
Abstract: A load/store pipeline in a computer processor for loading data to registers and storing data from the registers has a cache memory within the pipeline for storing data. The pipeline includes buffers which support multiple outstanding read request misses. Data from out of the pipeline is obtained independently of the operation of the pipeline, this data corresponding to the request misses. The cache memory can then be filled with the data that has been requested. The provision of a cache memory within the pipeline, and the buffers for supporting the cache memory, speed up loading operations for the computer processor.

Journal ArticleDOI
17 May 1988
TL;DR: Second level caches are shown to be particularly effective when used behind small on-chip caches; adding an 8K second-level to a 1K first-level increases performance by 26 percent, assuming similar parameters.
Abstract: We report on a trace-driven simulation study to examine the effect of a two-level cache hierarchy in uniprocessors. A simulation model of a multiple-cycle-per-instruction processor was constructed to estimate the total cycles required to execute a synthetic benchmark. Results show that a second-level cache can be used to increase system performance when main memory access times are large relative to CPU cycle time. For example, the addition of a 4-cycle, 64K second-level cache following a 1-cycle, 8K first-level cache increases performance by 15 percent when used in a system with a 15-cycle primary memory. Second level caches are shown to be particularly effective when used behind small on-chip caches; adding an 8K second-level to a 1K first-level increases performance by 26 percent, assuming similar parameters. We also evaluate the performance impact of different write strategies and separate I and D caches.

Patent
23 Jun 1988
TL;DR: In this article, a high speed buffer store arrangement for use in a data processing system having multiple cache buffer storage units in a hierarchial arrangement is presented, which enables fast transfer of wide data blocks.
Abstract: A high speed buffer store arrangement for use in a data processing system having multiple cache buffer storage units in a hierarchial arrangement permits fast transfer of wide data blocks. On each cache chip, input and output latches are integrated thus avoiding separate intermediate buffering. Input and output latches are interconnected by 64-byte wide data buses so that data blocks can be shifted rapidly from one cache hierarchy level to another and back. Chip-internal feedback connections from output to input latches allow data blocks to be selectively reentered into a cache after reading. An additional register array is provided so that data blocks can be furnished again after transfer from cache to main memory or CPU without accessing the respective cache. Wide data blocks can be transferred within one cycle, thus tying up caches much less in transfer operations, so that they have increased availability.

Patent
29 Jul 1988
TL;DR: In this article, a single chip cache address comparator with an on-chip static RAM for storing and checking the cache tags of an external cache memory is presented, which has a built-in incrementing counter which controls the burst fill of the internal cache of a 68020/68030 microprocessor from the associated external cache.
Abstract: A single chip cache address comparator with an on-chip static RAM for storing and checking the cache tags of an external cache memory. This cache address comparator has a built-in incrementing counter which controls the burst fill of the internal cache of a 68020/68030 microprocessor from the associated external cache memory within the required five processor clock cycles. Further, additional on-chip control logic is provided to control the 68020/68030 system buses to coordinate a burst fill operation.

Patent
09 May 1988
TL;DR: In this paper, tie-breaker circuits detect conditions relating to a request which could result in cache incoherency, it initiates uninterrupted sequences of cycles within the corresponding cache main or duplicate directory to complete the processing of that same request.
Abstract: A multiprocessor data processing system includes a processing unit which, together with other processing units, including input/output units, connects in common to an asynchronous bus network for sharing a main memory. At least one processing unit includes a synchronous private write through cache memory system which includes a main directory and data store in addition to a bus watcher and a duplicate directory. The bus watcher connects to the asynchronous bus network and captures all main memory requests while the duplicate directory maintains a copy of the cache unit's main directory. Independently and autonomously synchronously operated tie-breaker circuits apply requests to the main and duplicate directories. When tie-breaker circuits detect conditions relating to a request which could result in cache incoherency, it initiates uninterrupted sequences of cycles within the corresponding cache main or duplicate directory to complete the processing of that same request.

Patent
11 Oct 1988
TL;DR: A processor controlled interface between a processor, instruction cache, and main memory provides for simultaneously refilling the cache with an instruction block from main memory and processing the instructions in the block while they are being written to the cache.
Abstract: A processor controlled interface between a processor, instruction cache, and main memory provides for simultaneously refilling the cache with an instruction block from main memory and processing the instructions in the block while they are being written to the cache

Patent
Gerald Parks Bozman1
15 Sep 1988
TL;DR: A data cache in a computer operating system that dynamically adapts its size in response to competing demands for processor storage, and exploits the storage cooperatively with other operating system components is discussed in this article.
Abstract: A data cache in a computer operating system that dynamically adapts its size in response to competing demands for processor storage, and exploits the storage cooperatively with other operating system components An arbiter is used to determine the appropriate size of the cache based upon competing demands for memory The arbiter is entered cyclically and samples user's wait states The arbiter then makes a decision to decrease or increase the size of the cache in accordance with predetermined parameters

Patent
Takayuki Watanabe1
09 Feb 1988
TL;DR: In this paper, a cache memory is provided with a dual ported storage section so as to be independently accessible by a processor allocated to the cache memory and by another cache memory, in order to increase a multiprocessing speed.
Abstract: In order to increase a multiprocessing speed, a cache memory is provided with a dual ported storage section so as to be independently accessible by a processor allocated to the cache memory and by another cache memory. The dual ported storage section saves tag addresses and valid tag address information. Each of the tag addresses corresponds to data stored in a data storage section which forms part of the cache. One of two comparators coupled to the dual ported storage section checks to see if an address updated by another cache is in the cache. When this happens, the valid tag address information of the address is invalidated.

Patent
17 Oct 1988
TL;DR: A multi-cache data storage system has a number of cache units and a main memory as discussed by the authors, where each cache continuously monitors the bus for updates from other caches and checks whether it holds a data item corresponding to the physical address.
Abstract: A multi-cache data storage system has a number of cache units and a main memory. The caches are addressed by a virtual address. When data is updated in one of the caches, the virtual address is translated into a physical address and sent to the main memory over a bus, along with the updated data value. Each cache continuously monitors the bus for updates from other caches and checks whether it holds a data item corresponding to the physical address. If so, the data item is updated or invalidated, so as to ensure cache coherency.

Patent
27 Dec 1988
TL;DR: In this paper, look-ahead logic is provided to generate the next address of the memory location which is to be written into or read from with simultaneous transfer of information to or from the cache register which is connected to the memory ports.
Abstract: A memory having at least a pair of cache registers between the data input and output ports and the read and write ports of a memory matrix and controls to alternate the interconnection between the cache registers and the data input/output ports and the read/write terminals of the memory matrix, such that while one cache register is connected to the data input/output port, the other cache register is connected to a port of the memory. Look-ahead logic is provided to generate the next address of the memory location which is to be written into or read from with simultaneous transfer of information to or from the cache register which is connected to the memory ports.

Patent
Edward O. K. Kam1
30 Nov 1988
TL;DR: In this paper, a method for minimizing cache misses in a compiled computer program having loop instructions is presented, where the set of compiled loop instructions may straddle two blocks of main memory, which would cause cache misses when the program is executed.
Abstract: A method is provided for minimizing cache misses in a compiled computer program having loop instructions. The compiled computer program is examined to identify a set of compiled loop instructions which is smaller than a cache memory block. The set of compiled loop instructions may straddle two blocks of main memory, which would cause cache misses when the program is executed. The identified set of compiled loop instructions is therefore positioned to fall entirely within the boundaries of a block of main memory so that cache misses are avoided when the set of compiled loop instructions is executed. Loop-invariant instructions are removed from the set of compiled loop instructions. When blocks of the main memory unit are mapped into the cache memory in a set-associative manner, external-call locations are mapped into different rows of the main memory then the corresponding loop instructions. As a result, when blocks of main memory are transferred to the cache memory unit, cache misses are avoided.

Journal ArticleDOI
TL;DR: This study indicates that lessons from mainframe and minicomputer design practice should be critically examined to benefit the design of microprocessors.
Abstract: Design trade-offs for integrated microprocessors caches are examined. A model of cache utilization is introduced to evaluate the effects on cache performance of varying the block size. By considering the overhead cost of sorting address tags and replacement information along with data, it is found that large block sizes lead to more cost-effective cache designs than predicted by previous studies. When the overhead cost is high, caches that fetch only partial blocks on a miss perform better than similar caches that fetch entire blocks. This study indicates that lessons from mainframe and minicomputer design practice should be critically examined to benefit the design of microprocessors. >