scispace - formally typeset
Search or ask a question

Showing papers on "Cache invalidation published in 1985"


Journal ArticleDOI
01 Jun 1985
TL;DR: The protocol and its VLSI realization are described in some data, to emphasize the important implementation issues, in particular, the controller critical sections and the inter- and intra-cache interlocks needed to maintain cache consistency.
Abstract: We present an ownership-based multiprocessor cache consistency protocol, designed for implementation by a single chip VLSI cache controller. The protocol and its VLSI realization are described in some data, to emphasize the important implementation issues, in particular, the controller critical sections and the inter- and intra-cache interlocks needed to maintain cache consistency. The design has been carried through to layout in a P-Well CMOS technology.

286 citations


Journal ArticleDOI
TL;DR: It is found that disk cache is a powerful means of extending the performance limits of high-end computer systems.
Abstract: The current trend of computer system technology is toward CPUs with rapidly increasing processing power and toward disk drives of rapidly increasing density, but with disk performance increasing very slowly if at all. The implication of these trends is that at some point the processing power of computer systems will be limited by the throughput of the input/output (I/O) system.A solution to this problem, which is described and evaluated in this paper, is disk cache. The idea is to buffer recently used portions of the disk address space in electronic storage. Empirically, it is shown that a large (e.g., 80-90 percent) fraction of all I/O requests are captured by a cache of an 8-Mbyte order-of-magnitude size for our workload sample. This paper considers a number of design parameters for such a cache (called cache disk or disk cache), including those that can be examined experimentally (cache location, cache size, migration algorithms, block sizes, etc.) and others (access time, bandwidth, multipathing, technology, consistency, error recovery, etc.) for which we have no relevant data or experiments. Consideration is given to both caches located in the I/O system, as with the storage controller, and those located in the CPU main memory. Experimental results are based on extensive trace-driven simulations using traces taken from three large IBM or IBM-compatible mainframe data processing installations. We find that disk cache is a powerful means of extending the performance limits of high-end computer systems.

233 citations


01 Jan 1985
TL;DR: The model shows that the majority of the cache misses that OPT avoids over LRU come from the most-recently-discarded lines of the LRU cache, which leads to three realizable near-optimal replacement algorithms that try to duplicate the replacement decisions made by OPT.
Abstract: This thesis describes a model used to analyze the replacement decisions made by LRU and OPT (Least-Recently-Used and an optimal replacement-algorithm). The model identifies a set of lines in the LRU cache that are dead, that is, lines that must leave the cache before they can be rereferenced. The model shows that the majority of the cache misses that OPT avoids over LRU come from the most-recently-discarded lines of the LRU cache. Also shown is that a very small set of lines account for the majority of the misses that OPT avoids over LRU. OPT requires perfect knowledge of the future and is not realizable, but our results lead to three realizable near-optimal replacement algorithms. These new algorithms try to duplicate the replacement decisions made by OPT. Simulation results, using a trace-tape and cache simulator, show that these new algorithms achieve up to eight percent fewer misses than LRU and obtain about 20 percent of the miss reduction that OPT obtains. Also presented in the thesis are two new trace-tape reduction techniques. Simulation results show that reductions in trace-tape length of two orders of magnitude are possible with little or no simulation error introduced.

133 citations


Journal ArticleDOI
01 Jun 1985
TL;DR: In this article, the authors present measurements from a very wide variety of traces: there are 49 traces, taken from G machine architectures, (370, 360, VAX, MG8000, Z8000, CDC 6400), coded in 7 source languages.
Abstract: The selection of the "best" parameters for a cache design, such as size, mapping algorithm, fetch algorithm, line size, etc., is dependent on the expected workload. Similarly, the machine performance is sensitive to the cache performance which itself depends on the workload. Most cache designers have been greatly handicapped in their designs by the lack of realistic cache performance estimates. Published research generally presents data which is unrealistic in some respects, and available traces are often not representative. In this paper, we present measurements from a very wide variety of traces: there are 49 traces, taken from G machine architectures, (370, 360, VAX, MG8000, Z8000, CDC 6400), coded in 7 source languages. Statistics are shown for miss ratios, the effectiveness of prefetching in terms of both miss ratio and its effect on bus traffic, the frequency of writes, reads and instruction fetches, and the frequency of branches. Some general observations are made and a "design estimate" set of miss ratios are proposed. Some "fudge" factors are proposed by which statistics for workloads for one machine architecture can be used to estimate corresponding parameters for another (as yet unrealized) architecture. 1. I n t r o d u c t i o n Almost all medium and high performance machines and most high performance microprocessors now being designed will include cache memories to be used for instructions, for data or for both. There are a number of choices to be made regarding the cache including size, line size (block size), mapping algorithm, replacement algorithm, writeback algorithm, split (instructions/data) vs. unified, fetch algorithm, et cetera; see [Smit82] for a detailed discussion of these issues. Making the "best" choices and selecting the *'best" parameters (with respect to cost and performance) depends greatly on the workload to be expected [Macd84]. For example, a cache which achieves a 99% hit ratio may cost 80% more than one which achieves 98%, may increase the CPU cost by 25°7o and may only boost overall CPU performance by 8%; that suggests that the higher performing cache is not cost effective. However, if the same two designs yield hit ratios of 90°70 and 80~ respectively, and if the performance increase would be 50~, then different conclusions might well be reached. Computer architects have been handicapped by the lack of generally available realistic cache workload estimates. *Tke material prtseated here i* bued on reJe~rck n p p o r t ~ ia part by the Nation*! Science Foundation under g n a t DCR-IU202591 and by tie Ddeue Ad*~nced Rele~rck Project; Agency ¢uder contract N000~9-82-CC,-¢235. Computer time 5 u been provided by tke Stanford Linear Accderatot Center nnder Department of EuerKy contract DEA@0~-?~hSF-00SIS. While there are hundreds of published papers on cache memories (see [Smit82] for a partial bibliography), only a few present usable data. A large fraction contain no measurements at all. Almost all of the papers that do present measurements rely on trace driven simulation using a small set of traces, and for reasons explained further below, those gra.:es are likely to be unrepresentative of the results to be expected in practice. There do exist some realistic numbers, as we note below, but they are hardly enough to constitute a design database. The purpose of this paper is discuss and explain workload selection as it relates to cache memory design, and to present data from which the designer can work. We have used 49 program address traces taken from 0 (or 5, if the 300 and 370 are the same) machine architectures (VAX, 370, 300/91, Z8000, CDC 0400, M08000), derived from 7 programming languages (Fortran, 370 Assembler, APL, C, LISP, AlgolW, Cobol) to compute overall, instruction and data miss ratios and bus traffic rates for various cache designs; these experiments show the variety of workload behavior possible. Characteristics of the traces are tabulated and the effects of some design choices are evaluated. Finally, we present what we consider to be a "re~onable" set of numbers with which we believe designers can comfortably work. In that discussion, we also suggest some "fudge" factors, which indicate how realistic (or available) number~ for machine architecture MI under workload conditions W1 can be used to estimate similar parameters for architecture M2 under workload WI*. In the remainder of this section, we discuss additional background for our measurement results. First we consider the advantages and disadvantages of trace driven simulation. Then we review some {possible) eases of performanee misprediction and also discuss some published and valid miss ratio figures. The second section discusees the traces used. The measurement results and analysis are in section 3, and in section 4 we propose target workload values and factors by which one workload can be used to estimate another. Section 5 summarizes our findings. 1.1 . T r a c e Dr iven Simula t ion A p r o g r a m a d d r e n t race is a trace of the sequence of (virtual) addresses accessed by a computer program or programs. T r a c e dr iven l imul&tlon involves driving a simulation model of a system with a trace of external stimuli rather than with a random number generator. Trace driven simulation is a very good way to study many aspects of cache design and performance, for a number of reasons. First, it is superior to either pure mathematical models or random number driven simulation because there do not currently exist any generally accepted or believable models for those characteristics of program behavior that determine cache performance; thus it is not possible to specify a realistic model nor to drive a simulator with a good representation of a program. A trace properly 0149-7111/85/0000/0064501.00 © 1985 IEEE 64 represents at least one real program, and in certain respects can be expected to drive the simulator correctly. It is important to note that a trace reflects not only the program traced and the functional architecture of the machine (instruction set) but also the design a r c h i t e c t u r e (higher level implementation). In particular, t h e n u m b e r o f m e m o r y references k affected by the width o f the data path to memory : fetching two four-byte instructions requires 4, 2 or 1 memory reference, depending on whether the memory interface is 2, 4 or 8 bytes wide. It also depends on how much "memory" the interface itself has; if one request is for 4 bytes, the next request is for the next four bytes, and the interface is 8 bytes wide, then fewer fetches will result if the interface "remembers" that it has the target four bytes of the second fetch rather than redoing the fetch. The interface can be quite complex, as with the lfetch buffer in the VAX 11/780 [Clar83] and can behave differently for instructions and data. (A trace should reflect, to the greatest possible extent, only the functional architecture; the design architecture should and usually can be emulated in the simulator.) A simulator is also much better in many ways than the construction of prototype designs. It is far faster to build a simulator, and the design being simulated can be varied easily, sometimes by just changing an input parameter. Conversely, a hardware prototype can require man-years to build and can be varied little if at all. Also, the results of a live workload tend to yield slightly different results (e.g. 1% to 3%) from run to run, depending on the random setting of initial conditions such as the angular position of the disks [Curt75]. For the reasons given above, trace driven simulation has been used for almost every research paper which presents cache measurements, with a few exceptions discussed below. There are, however, several reasons why the results of trace driven simulations should be taken with a grain of salt. (1) A trace driven simulation of a million memory addresses, which is fairly long, represents about 1/30 of a second for a machine such as the IBM 3081, and only about one second for an M68000; thus a trace is only a very small sample of a real workload. (2) Traces seldom are taken from the "messiest" parts of large programs; more often they are traces of the initial portions of small programs. (3) It is very difficult to trace the operating system (OS) and few OS traces are available. On many machines, however, the OS dominates the workload. (4) Most real machines task switch every few thousand instructions and are constantly taking interrupts. It is difficult to include this effect accurately in a trace driven simulation and many simulators don't try. (5) The sequence of memory addresses presented to the cache can vary with hardware buffers such ~ prefetch buffers and loop buffers, and is certainly sensitive to the data path width. Thus the trace itself may not be completely accurate with ree.pect to the implementation of the architecture. (0) In running machines, a certain (usually small) fraction of the cache activity is due to input/output; this effect is seldom included in trace driven simulations. In this paper we are primarily concerned with items 1-3 immediately above. By presenting the results of a very large number of simulations, one can get an idea of the range of program behavior. Included are two traces of IBM's MVS operating system, which should have performance that is close to the worst likely to be observed. 1.2. Rea l W o r k l o a d s and Ques t ionab le Estlmattm There arc only a small number of papers in which provide measurements taken by hardware monitors from running machines. In [Mila75] it is reported that a 16K cache on an IBM 370/105-2 running VS2 had a 0.94 hit ratio, with 1.6 fetches per instruction and .22 stores/instruction; it is also found that 73% of the CPU cycles were used in supervisor state. Merrill [Merr74] found cache hit ratios for a 16K cache in the 370/168 of 0.932 to 0.997 for six applications programs, and also reports that the performance (MI

119 citations


Journal ArticleDOI
TL;DR: Instruction cache replacement policies and organizations are analyzed both theoretically and experimentally and it is concluded theoretically that random replacement is better than LRU and FIFO, and that under certain circumstances, a direct-mapped or set-associative cache may perform better than a full-associate cache organization.
Abstract: Instruction cache replacement policies and organizations are analyzed both theoretically and experimentally. Theoretical analyses are based on a new model for cache references —the loop model. First the loop model is used to study replacement policies and cache organizations. It is concluded theoretically that random replacement is better than LRU and FIFO, and that under certain circumstances, a direct-mapped or set-associative cache may perform better than a full-associative cache organization. Experimental results using instruction trace data are then given and analyzed. The experimental results indicate that the loop model provides a good explanation for observed cache performance.

77 citations


Patent
31 Jul 1985
TL;DR: In this article, a cache hierarchy to be managed by a memory management unit (MMU) combines the advantages of logical and virtual address caches by providing cache hierarchy having a logical address cache backed up by a virtual address cache.
Abstract: A cache hierarchy to be managed by a memory management unit (MMU) combines the advantages of logical and virtual address caches by providing a cache hierarchy having a logical address cache backed up by a virtual address cache to achieve the performance advantage of a large logical address cache, and the flexibility and efficient use of cache capacity of a large virtual address cache. A physically small logical address cache is combined with a large virtual address cache. The provision of a logical address cache enables reference count management to be done completely by the controller of the virtual address cache and the memory management processor in the MMU. Since the controller of the logical address cache is not involved in the overhead associated with reference counting, higher performance is accomplished as the CPU-MMU interface is released as soon as the access to the logical address cache is completed.

45 citations


Journal ArticleDOI
TL;DR: It is shown that a cache of size h, applied optimally to a uniformly random sequence on an alphabet of size d, is able to avoid faults with probability of order h d.

33 citations


Patent
01 May 1985
TL;DR: In this paper, a cache memory control system has a segment descriptor with a 1-bit cache memory unit designation field, and a register for storing data representing the cache memory units designation field.
Abstract: A cache memory control system has a segment descriptor with a 1-bit cache memory unit designation field, and a register for storing data representing the cache memory unit designation field. An output from the register is supplied to one cache memory unit, whereas inverted data of the output from the register is supplied to the other cache memory unit.

18 citations


Patent
21 May 1985
TL;DR: In this paper, the authors proposed a data processing system comprising multiple cache buffer stores (17, 19) in a hierarchical arrangement, enabling fast transfer of wide data blocks is enabled by particular cache configurations and cache interconnections.
Abstract: In a data processing system comprising multiple cache buffer stores (17, 19) in a hierarchical arrangement, fast transfer of wide data blocks is enabled by particular cache configurations and cache interconnections. On each cache chip, input and output (39, 45) latches are integrated thus avoiding separate intermediate buffering. Input and output latches are interconnected by 64-byte wide data buses (B, A'; D, A") so that data blocks can be shifted rapidly from one cache hierarchy level to another and back. Chip-internal feedback connections from output to input latches allow to selectively reenter data blocks into a cache after reading. An additional register array (47) is provided so that data blocks after transfer from a cache to main memory or CPU can be subsequently furnished again without accessing the respective cache. The disclosed system allows to transfer wide data blocks within one cycle, thus tying-up caches much less in transfer operations, so that their availability is increased.

15 citations


Patent
07 Nov 1985
TL;DR: In this paper, a new use for an LRU-managed cache coupling the main memory of a CPU for sort string generation of m records while minimizing the number of reference misses per record to said cache is described.
Abstract: A new use for an LRU-managed cache coupling the main memory of a CPU for sort string generation of m records while minimizing the number of reference misses per record to said cache is described. During a first pass, a partially nested ordering or sort is effectuated using the cache, and then during a second pass a replacement selection merge upon the nested order constrained to fit within the cache is brought about.

15 citations


Patent
08 Feb 1985
TL;DR: In this paper, the cache coherence system detects when the contents of storage locations in the cache memories of the one or more of the data processors have been modified in conjunction with the activity those data processors and is responsive to such detections to generate and store in its cache invalidate table (CIT) memory a multiple element linked list.
Abstract: A cache coherence system for a multiprocessor system including a plurality of data processors coupled to a common main memory. Each of the data processors includes an associated cache memory having storage locations therein corresponding to storage locations in the main memory. The cache coherence system for a data processor includes a cache invalidate table (CIT) memory having internal storage locations corresponding to locations in the cache memory of the data processor. The cache coherence system detects when the contents of storage locations in the cache memories of the one or more of the data processors have been modified in conjunction with the activity those data processors and is responsive to such detections to generate and store in its CIT memory a multiple element linked list defining the locations in the cache memories of the data processors having modified contents. Each element of the list defines one of those cache storage locations and also identifies the location in the CIT memory of the next element in the list.

Patent
Philip Lewis Rosenfeld1, Kimming So1
13 Aug 1985
TL;DR: In this paper, a working set history table is included which keeps a record of which lines in an L2 block where utilised when resident in the L2 cache through the use of tags.
Abstract: In a computing system including a three level memory hierarchy comprised of a first level cache (L1), a second level cache (L2) and a main memory (L3), a working set history table is included which keeps a record of which lines in an L2 block where utilised when resident in the L2 cache through the use of tags. When this L2 block, or the material part thereof, is returned to main memory and is subsequently requested, only the lines which were utilised in the last residency are transferred to the L2 cache. That is, there is a tag for future use of a line based on its prior use during its last residency in the L2 cache.

Patent
21 Mar 1985
TL;DR: In this paper, a pipelined digital computer processor system is provided comprising an instruction prefetch unit (IPU, 2) for prefetching instructions and an arithmetic logic processing unit (ALPU, 4) for executing instructions.
Abstract: A pipelined digital computer processor system (10, Figure 1) is provided comprising an instruction prefetch unit (IPU, 2) for prefetching instructions and an arithmetic logic processing unit (ALPU, 4) for executing instructions. The IPU (2) has associated with it a high speed instruction cache (6), and the ALPU (4) has associated with it a high speed operand cache (8). Each cache comprises a data store (84, 94, Figure 3) for storing frequently accessed data, and a tag store (82, 92, Figure 3) for indicating which main memory locations are contained in the respective cache. The IPU and ALPU processing units (2, 4) may access their associated caches independently under most conditions. When the ALPU performs a write operation to main memory, it also updates the corresponding data in the operand cache and, if contained therein, in the instruction cache. The IPU does not write to either cache. Provision is made for clearing the caches on certain conditions when their contents become invalid.

Patent
25 Apr 1985
TL;DR: A register unit includes means for storing pertinent data relative to a plurality to cache transactions, identifying the zones of an addressed word block which is the subject of the individual transactions.
Abstract: A register unit includes means for storing pertinent data relative to a plurality to cache transactions, identifying the zones of an addressed word block which is the subject of the individual transactions. These data are selectively extracted from the register to control the merging of the identified zone or zones of the associated word with the remainder of the data in the addressed word block.

Patent
15 Oct 1985
TL;DR: In this paper, a directory is converted to physical addresses in the cache where the data is stored in blocks and the blocks are expanded to include redundant addressing information such as the logical data address and the physical cache address.
Abstract: A redundant error-detecting addressing method and system for use in a cache memory. A directory converts togical data addresses to physical addresses in the cache where the data is stored in blocks. The blocks are expanded to include redundant addressing information such as the logical data address and the physical cache address. When a block is accessed from the cache, the redundant addressing is compared to the directory addressing information to confirm that the correct data has been accessed.

Patent
26 Jan 1985
TL;DR: In this paper, a cache invalidation controller contains a base register 1, a distance resistance register 2, an element number register 3, a block size register 4, an address forming circuit 5, element number checking circuit 6, a vector address degeneration designating circuit 7, a directory 100, an invalidating circuit 150 and transfer buses 201-207.
Abstract: PURPOSE:To perform the cache invalidation processing every block address to improve the processing efficiency by checking the number of vector store elements contained in one block size of a cache memory. CONSTITUTION:A cache invalidation controller contains a base register 1, a distance resistance register 2, an element number register 3, a block size register 4, an address forming circuit 5, an element number checking circuit 6, a vector address degeneration designating circuit 7, a directory 100, an invalidating circuit 150 and transfer buses 201-207. The directory 100 is provided with a set address register 101, a block address register 102, memory circuits 110 and 111, comparators 121 and 122, gates 131-133 and registers 140-143. The circuit 150 is provided with a V-bit read address register 151, V-bit memory circuits 153 and 154, a V-bit invalidation writing address register 155 and an invalidation control circuit 156.