scispace - formally typeset
Search or ask a question

Showing papers on "Cache published in 1985"


Journal ArticleDOI
01 Jun 1985
TL;DR: The protocol and its VLSI realization are described in some data, to emphasize the important implementation issues, in particular, the controller critical sections and the inter- and intra-cache interlocks needed to maintain cache consistency.
Abstract: We present an ownership-based multiprocessor cache consistency protocol, designed for implementation by a single chip VLSI cache controller. The protocol and its VLSI realization are described in some data, to emphasize the important implementation issues, in particular, the controller critical sections and the inter- and intra-cache interlocks needed to maintain cache consistency. The design has been carried through to layout in a P-Well CMOS technology.

286 citations


Journal ArticleDOI
TL;DR: It is found that disk cache is a powerful means of extending the performance limits of high-end computer systems.
Abstract: The current trend of computer system technology is toward CPUs with rapidly increasing processing power and toward disk drives of rapidly increasing density, but with disk performance increasing very slowly if at all. The implication of these trends is that at some point the processing power of computer systems will be limited by the throughput of the input/output (I/O) system.A solution to this problem, which is described and evaluated in this paper, is disk cache. The idea is to buffer recently used portions of the disk address space in electronic storage. Empirically, it is shown that a large (e.g., 80-90 percent) fraction of all I/O requests are captured by a cache of an 8-Mbyte order-of-magnitude size for our workload sample. This paper considers a number of design parameters for such a cache (called cache disk or disk cache), including those that can be examined experimentally (cache location, cache size, migration algorithms, block sizes, etc.) and others (access time, bandwidth, multipathing, technology, consistency, error recovery, etc.) for which we have no relevant data or experiments. Consideration is given to both caches located in the I/O system, as with the storage controller, and those located in the CPU main memory. Experimental results are based on extensive trace-driven simulations using traces taken from three large IBM or IBM-compatible mainframe data processing installations. We find that disk cache is a powerful means of extending the performance limits of high-end computer systems.

233 citations


Patent
27 Jun 1985
TL;DR: In this paper, the authors propose a cache coherency management scheme for a shared bus multiprocessor which includes several processors each having its own private cache memory, each private cache is connected to a first bus to which a second, higher level cache memory is also connected.
Abstract: A caching system for a shared bus multiprocessor which includes several processors each having its own private cache memory. Each private cache is connected to a first bus to which a second, higher level cache memory is also connected. The second, higher level cache in turn is connected either to another bus and higher level cache memory or to main system memory through a global bus. Each higher level cache includes enough memory space so as to enable the higher level cache to have a copy of every memory location in the caches on the level immediately below it. In turn, main memory includes enough space for a copy of each memory location of the highest level of cache memories. The caching can be used with either write-through or write-deferred cache coherency management schemes.

190 citations


01 Jan 1985
TL;DR: The model shows that the majority of the cache misses that OPT avoids over LRU come from the most-recently-discarded lines of the LRU cache, which leads to three realizable near-optimal replacement algorithms that try to duplicate the replacement decisions made by OPT.
Abstract: This thesis describes a model used to analyze the replacement decisions made by LRU and OPT (Least-Recently-Used and an optimal replacement-algorithm). The model identifies a set of lines in the LRU cache that are dead, that is, lines that must leave the cache before they can be rereferenced. The model shows that the majority of the cache misses that OPT avoids over LRU come from the most-recently-discarded lines of the LRU cache. Also shown is that a very small set of lines account for the majority of the misses that OPT avoids over LRU. OPT requires perfect knowledge of the future and is not realizable, but our results lead to three realizable near-optimal replacement algorithms. These new algorithms try to duplicate the replacement decisions made by OPT. Simulation results, using a trace-tape and cache simulator, show that these new algorithms achieve up to eight percent fewer misses than LRU and obtain about 20 percent of the miss reduction that OPT obtains. Also presented in the thesis are two new trace-tape reduction techniques. Simulation results show that reductions in trace-tape length of two orders of magnitude are possible with little or no simulation error introduced.

133 citations


Journal ArticleDOI
TL;DR: In this article, the authors present the results of a set of measurements and simulations of translation buffer performance in the VAX-11/780 computers, and measurement were made under normal time sharing use and while running reproducible synthetic time-sharing work loads.
Abstract: A virtual-address translation buffer (TB) is a hardware cache of recently used virtual-to-physical address mappings. The authors present the results of a set of measurements and simulations of translation buffer performance in the VAX-11/780. Two different hardware monitors were attached to VAX-11/780 computers, and translation buffer behavior was measured. Measurements were made under normal time-sharing use and while running reproducible synthetic time-sharing work loads. Reported measurements include the miss ratios of data and instruction references, the rate of TB invalidations due to context switches, and the amount of time taken to service TB misses. Additional hardware measurements were made with half the TB disabled. Trace-driven simulations of several programs were also run; the traces captured system activity as well as user-mode execution. Several variants of the 11/780 TB structure were simulated.

128 citations


Journal ArticleDOI
01 Jun 1985
TL;DR: In this article, the authors present measurements from a very wide variety of traces: there are 49 traces, taken from G machine architectures, (370, 360, VAX, MG8000, Z8000, CDC 6400), coded in 7 source languages.
Abstract: The selection of the "best" parameters for a cache design, such as size, mapping algorithm, fetch algorithm, line size, etc., is dependent on the expected workload. Similarly, the machine performance is sensitive to the cache performance which itself depends on the workload. Most cache designers have been greatly handicapped in their designs by the lack of realistic cache performance estimates. Published research generally presents data which is unrealistic in some respects, and available traces are often not representative. In this paper, we present measurements from a very wide variety of traces: there are 49 traces, taken from G machine architectures, (370, 360, VAX, MG8000, Z8000, CDC 6400), coded in 7 source languages. Statistics are shown for miss ratios, the effectiveness of prefetching in terms of both miss ratio and its effect on bus traffic, the frequency of writes, reads and instruction fetches, and the frequency of branches. Some general observations are made and a "design estimate" set of miss ratios are proposed. Some "fudge" factors are proposed by which statistics for workloads for one machine architecture can be used to estimate corresponding parameters for another (as yet unrealized) architecture. 1. I n t r o d u c t i o n Almost all medium and high performance machines and most high performance microprocessors now being designed will include cache memories to be used for instructions, for data or for both. There are a number of choices to be made regarding the cache including size, line size (block size), mapping algorithm, replacement algorithm, writeback algorithm, split (instructions/data) vs. unified, fetch algorithm, et cetera; see [Smit82] for a detailed discussion of these issues. Making the "best" choices and selecting the *'best" parameters (with respect to cost and performance) depends greatly on the workload to be expected [Macd84]. For example, a cache which achieves a 99% hit ratio may cost 80% more than one which achieves 98%, may increase the CPU cost by 25°7o and may only boost overall CPU performance by 8%; that suggests that the higher performing cache is not cost effective. However, if the same two designs yield hit ratios of 90°70 and 80~ respectively, and if the performance increase would be 50~, then different conclusions might well be reached. Computer architects have been handicapped by the lack of generally available realistic cache workload estimates. *Tke material prtseated here i* bued on reJe~rck n p p o r t ~ ia part by the Nation*! Science Foundation under g n a t DCR-IU202591 and by tie Ddeue Ad*~nced Rele~rck Project; Agency ¢uder contract N000~9-82-CC,-¢235. Computer time 5 u been provided by tke Stanford Linear Accderatot Center nnder Department of EuerKy contract DEA@0~-?~hSF-00SIS. While there are hundreds of published papers on cache memories (see [Smit82] for a partial bibliography), only a few present usable data. A large fraction contain no measurements at all. Almost all of the papers that do present measurements rely on trace driven simulation using a small set of traces, and for reasons explained further below, those gra.:es are likely to be unrepresentative of the results to be expected in practice. There do exist some realistic numbers, as we note below, but they are hardly enough to constitute a design database. The purpose of this paper is discuss and explain workload selection as it relates to cache memory design, and to present data from which the designer can work. We have used 49 program address traces taken from 0 (or 5, if the 300 and 370 are the same) machine architectures (VAX, 370, 300/91, Z8000, CDC 0400, M08000), derived from 7 programming languages (Fortran, 370 Assembler, APL, C, LISP, AlgolW, Cobol) to compute overall, instruction and data miss ratios and bus traffic rates for various cache designs; these experiments show the variety of workload behavior possible. Characteristics of the traces are tabulated and the effects of some design choices are evaluated. Finally, we present what we consider to be a "re~onable" set of numbers with which we believe designers can comfortably work. In that discussion, we also suggest some "fudge" factors, which indicate how realistic (or available) number~ for machine architecture MI under workload conditions W1 can be used to estimate similar parameters for architecture M2 under workload WI*. In the remainder of this section, we discuss additional background for our measurement results. First we consider the advantages and disadvantages of trace driven simulation. Then we review some {possible) eases of performanee misprediction and also discuss some published and valid miss ratio figures. The second section discusees the traces used. The measurement results and analysis are in section 3, and in section 4 we propose target workload values and factors by which one workload can be used to estimate another. Section 5 summarizes our findings. 1.1 . T r a c e Dr iven Simula t ion A p r o g r a m a d d r e n t race is a trace of the sequence of (virtual) addresses accessed by a computer program or programs. T r a c e dr iven l imul&tlon involves driving a simulation model of a system with a trace of external stimuli rather than with a random number generator. Trace driven simulation is a very good way to study many aspects of cache design and performance, for a number of reasons. First, it is superior to either pure mathematical models or random number driven simulation because there do not currently exist any generally accepted or believable models for those characteristics of program behavior that determine cache performance; thus it is not possible to specify a realistic model nor to drive a simulator with a good representation of a program. A trace properly 0149-7111/85/0000/0064501.00 © 1985 IEEE 64 represents at least one real program, and in certain respects can be expected to drive the simulator correctly. It is important to note that a trace reflects not only the program traced and the functional architecture of the machine (instruction set) but also the design a r c h i t e c t u r e (higher level implementation). In particular, t h e n u m b e r o f m e m o r y references k affected by the width o f the data path to memory : fetching two four-byte instructions requires 4, 2 or 1 memory reference, depending on whether the memory interface is 2, 4 or 8 bytes wide. It also depends on how much "memory" the interface itself has; if one request is for 4 bytes, the next request is for the next four bytes, and the interface is 8 bytes wide, then fewer fetches will result if the interface "remembers" that it has the target four bytes of the second fetch rather than redoing the fetch. The interface can be quite complex, as with the lfetch buffer in the VAX 11/780 [Clar83] and can behave differently for instructions and data. (A trace should reflect, to the greatest possible extent, only the functional architecture; the design architecture should and usually can be emulated in the simulator.) A simulator is also much better in many ways than the construction of prototype designs. It is far faster to build a simulator, and the design being simulated can be varied easily, sometimes by just changing an input parameter. Conversely, a hardware prototype can require man-years to build and can be varied little if at all. Also, the results of a live workload tend to yield slightly different results (e.g. 1% to 3%) from run to run, depending on the random setting of initial conditions such as the angular position of the disks [Curt75]. For the reasons given above, trace driven simulation has been used for almost every research paper which presents cache measurements, with a few exceptions discussed below. There are, however, several reasons why the results of trace driven simulations should be taken with a grain of salt. (1) A trace driven simulation of a million memory addresses, which is fairly long, represents about 1/30 of a second for a machine such as the IBM 3081, and only about one second for an M68000; thus a trace is only a very small sample of a real workload. (2) Traces seldom are taken from the "messiest" parts of large programs; more often they are traces of the initial portions of small programs. (3) It is very difficult to trace the operating system (OS) and few OS traces are available. On many machines, however, the OS dominates the workload. (4) Most real machines task switch every few thousand instructions and are constantly taking interrupts. It is difficult to include this effect accurately in a trace driven simulation and many simulators don't try. (5) The sequence of memory addresses presented to the cache can vary with hardware buffers such ~ prefetch buffers and loop buffers, and is certainly sensitive to the data path width. Thus the trace itself may not be completely accurate with ree.pect to the implementation of the architecture. (0) In running machines, a certain (usually small) fraction of the cache activity is due to input/output; this effect is seldom included in trace driven simulations. In this paper we are primarily concerned with items 1-3 immediately above. By presenting the results of a very large number of simulations, one can get an idea of the range of program behavior. Included are two traces of IBM's MVS operating system, which should have performance that is close to the worst likely to be observed. 1.2. Rea l W o r k l o a d s and Ques t ionab le Estlmattm There arc only a small number of papers in which provide measurements taken by hardware monitors from running machines. In [Mila75] it is reported that a 16K cache on an IBM 370/105-2 running VS2 had a 0.94 hit ratio, with 1.6 fetches per instruction and .22 stores/instruction; it is also found that 73% of the CPU cycles were used in supervisor state. Merrill [Merr74] found cache hit ratios for a 16K cache in the 370/168 of 0.932 to 0.997 for six applications programs, and also reports that the performance (MI

119 citations


Patent
28 Jun 1985
TL;DR: In this paper, the authors propose a method and apparatus for reducing initialization time in a peripheral storage system including a large number of subsystems, each including a cache store in a control unit and a back store having a plurality of DASD storage devices.
Abstract: Method and apparatus for reducing initialization time in a peripheral storage system (10) including a large number of subsystems, each including a cache store in a control unit (16-24) and a back store having a plurality of DASD storage devices, (26-38) includes means for receiving an initialization command from a host system (12); means for generating a first signal indicating host command accepted; first means for allocating locations in the cache store for an index table; second means for allocating additional locations in the cache store for subsystem control structures; third means for allocating locations in the cache store for track buffers; fourth means for allocating a limited number of record slots in cache, said number determined by a maximum allowable system initialization time, a total number of subsystem devices to be initialized and a time required to initialize each record slot; and means for generating a second signal indicating that a limited initialization has been completed.

90 citations


Patent
28 Jun 1985
TL;DR: In this article, the most recently updated version of one or more records stored in the subsystem where one version of a record may be in a back store and a modified version of the record in a cache storage, by a method and apparatus including table lookup means for indicating for each record of each track of each device in said back store whether record data stored in cache is modified with respect to a version of same record stored in back store.
Abstract: Data is accessed by record identification in a peripheral sub-system having cache and back store to transfer the most recently updated version of one or more records stored in the subsystem where one version of a record may be in a back store and a modified version of the record may be in a cache storage, by a method and apparatus including table lookup means for indicating for each record of each track of each device in said back store whether record data stored in cache is modified with respect to a version of the same record stored in back store; and means responsive to the indicating means for transferring the modified or backing store version of a record in accordance with predetermined criteria.

81 citations


Journal ArticleDOI
TL;DR: Instruction cache replacement policies and organizations are analyzed both theoretically and experimentally and it is concluded theoretically that random replacement is better than LRU and FIFO, and that under certain circumstances, a direct-mapped or set-associative cache may perform better than a full-associate cache organization.
Abstract: Instruction cache replacement policies and organizations are analyzed both theoretically and experimentally. Theoretical analyses are based on a new model for cache references —the loop model. First the loop model is used to study replacement policies and cache organizations. It is concluded theoretically that random replacement is better than LRU and FIFO, and that under certain circumstances, a direct-mapped or set-associative cache may perform better than a full-associative cache organization. Experimental results using instruction trace data are then given and analyzed. The experimental results indicate that the loop model provides a good explanation for observed cache performance.

77 citations


Patent
John Hrustich1, Earl Whitney Jackson1
20 Aug 1985
TL;DR: In this paper, the first processor will query the cache of the second processor in an attempt to locate the specific data, and if the data is located, an indication of the existence of the data in the cache is sent and stored in a status register associated with the second Processor.
Abstract: In a multiprocessor system, including a first processor, a second processor, a main memory (otherwise termed a Basic Storage Module - BSM), and a control circuit (termed BSM controls), the first processor may attempt to locate specific data in its cache and fail to locate this data. The first processor will query the cache of the second processor in an attempt to locate the specific data. If the data is located, an indication of the existence of the data in the cache of the second processor is sent and stored in a status register associated with the second processor. As a result, preliminary steps have been taken to "flush" or move the data from the cache of the second processor to the BSM in order for the first processor to utilize the data. However, prior to the flush operation, it is necessary for the second processor to synchronize its clocks with the clocks of the BSM controls. When this synchronization is complete, the data in the cache of the second processor is flushed or moved to the BSM. The status register associated with the second processor is cleared. The data is then utilized by the first processor in the execution of an instruction. The present invention is directed to the synchronization of the clocks of the second processor with the clocks of the BSM controls prior to flushing desired data from the cache of the second processor to the BSM.

74 citations


Patent
17 Jul 1985
TL;DR: Reference Counting as discussed by the authors is a method and apparatus for managing a block oriented memory of the type in which each memory block has an associated reference count representing the number of pointers to it from other memory blocks and itself.
Abstract: A method and apparatus for managing a block oriented memory of the type in which each memory block has an associated reference count representing the number of pointers to it from other memory blocks and itself. Efficient and cost-effective implementation of reference counting alleviates the need for frequent garbage collection, which is an expensive operation. The apparatus includes a hash table into which the virtual addresses of blocks of memory which equal zero are maintained. When the reference count of a block increases from zero, its virtual address is removed from the table. When the reference count of a block decreases to zero, its virtual address is inserted into the table. When the table is full, a reconciliation operation is performed to identify those addresses which are contained in a set of binding registers associated with the CPU, and any address not contained in the binding registers are evacuated into a garbage buffer for subsequent garbage collection operations. The apparatus can be implemented by a cache augmented by the hash table, providing a back-up store for the cache.

Patent
12 Nov 1985
TL;DR: In this article, an address translation unit (118), an instruction processing unit (126), an address scalar unit (142), a vector control unit (144), and vector processing units (148, 150).
Abstract: A physical cache unit (100) is used within a computer (20). The computer (20) further includes a main memory (99) a memory control unit (22), inputs/output processors (54, 68) and a central processor (156). The central processor includes an address translation unit (118), an instruction processing unit (126), an address scalar unit (142), a vector control unit (144) and vector processing units (148, 150). The physical cache unit (100) stores operands in a data cache (180), the operands for delivery to and receipt from the control processor (156). Addresses for requested operands are received from the central processor (156) and are examined concurrently during one clock cycle in tag stores (190 and 192). The tag stores (190 and 192) produce tags which are compared in comparators (198 and 200) to the tag of physical addresses received from the central processor (156). If a comparison is made, a hit, both of the requested operands are read, during one clock period, from the data cache (180) and transmitted to the central processor (156). If the requested operands are not in the data cache (180) they are fetched from the main memory (99). The operands requested from the main memory (99) within a block are placed in a buffer (188) and/or transmitted directly through a bypass bus (179) to the central processor (156). Concurrently, the block of operands fetched from main memory (99) may be stored in the data cache (180) for subsequent delivery to the central processor (156) upon request. Further, a block of operands from the central processor (156) can be transmitted directly to the memory control unit 22 and bypass the data cache (180).

Patent
11 Mar 1985
TL;DR: In this paper, a direct execution microprogrammable microprocessor system uses an emulatory micro programmable micro processor for direct execution of microinstructions in main memory through a microinstruction port.
Abstract: A direct-execution microprogrammable microprocessor system uses an emulatory microprogrammable microprocessor for direct execution of microinstructions in main memory through a microinstruction port. A microinstruction cache with a microinstruction address extension unit serving to communicate microinstructions from the main memory to the microprogrammable microprocessor. Virtual main memory accesses occur through a system multiplexer. A virtual address extension unit and a virtual address bus provide extension and redefinition of the main memory address space of the microprogrammable microprocessor. The system also uses a context switching stack cache and an expanded address translation cache with the microprogrammable microprocessor having a reduced and redefined microinstruction set with a variable microinstruction cycle.

Journal ArticleDOI
TL;DR: The architecture of a very-high-speed logic simulation machine (HAL), which can simulate up to one-half million gates and 2M-byte memory chips at a 5 ms clock speed, is described and is now in use as a tool for large mainframe computer development.
Abstract: The architecture of a very-high-speed logic simulation machine (HAL), which can simulate up to one-half million gates and 2M-byte memory chips at a 5 ms clock speed, is described. This machine makes it possible to debug the total system?CPU, main memory, cache memory and control storage?before the actual machine is fabricated. HAL employs parallel and pipeline processing, and event-driven, block-level logic simulation. The prototype system for a 32-processor system has been constructed and is now in use as a tool for large mainframe computer development. HAL is more than a thousand times faster than existing software logic simulators.

Patent
31 Jul 1985
TL;DR: In this article, a cache hierarchy to be managed by a memory management unit (MMU) combines the advantages of logical and virtual address caches by providing cache hierarchy having a logical address cache backed up by a virtual address cache.
Abstract: A cache hierarchy to be managed by a memory management unit (MMU) combines the advantages of logical and virtual address caches by providing a cache hierarchy having a logical address cache backed up by a virtual address cache to achieve the performance advantage of a large logical address cache, and the flexibility and efficient use of cache capacity of a large virtual address cache. A physically small logical address cache is combined with a large virtual address cache. The provision of a logical address cache enables reference count management to be done completely by the controller of the virtual address cache and the memory management processor in the MMU. Since the controller of the logical address cache is not involved in the overhead associated with reference counting, higher performance is accomplished as the CPU-MMU interface is released as soon as the access to the logical address cache is completed.

Proceedings Article
18 Aug 1985
TL;DR: The analysis extends previous work on caching by considering side effects, shared data structures, program edits, and the acceptability of behavior changes caused by caching.
Abstract: A common program optimization strategy is to eliminate recomputation by caching and reusing results. We analyze the problems involved in automating this strategy: deciding which computations are safe to cache, transforming the rest of the program to make them safe, choosing the most cost-effective ones to cache, and maintaining the optimized code. The analysis extends previous work on caching by considering side effects, shared data structures, program edits, and the acceptability of behavior changes caused by caching. The paper explores various techniques for solving these problems and attempts to make explicit the assumptions on which they depend. An experimental prototype incorporates many of these techniques.

Patent
22 Feb 1985
TL;DR: In this article, a simplified cache with automatic updating for use in a memory system is presented. But the cache and the main memory receive data from a common input, and when a memory write operation is performed on data stored at a memory location for which there is a corresponding cache location, the data is written simultaneously to the cache.
Abstract: A simplified cache with automatic updating for use in a memory system. The cache and the main memory receive data from a common input, and when a memory write operation is performed on data stored at a memory location for which there is a corresponding cache location, the data is written simultaneously to the cache and to the main memory. Since a cache location coresponding to a memory location always contains a copy of the data at the memory location, there is no need for dirty bits or valid bits in the cache resisters and the associated logic in the cache control. The main memory used with the invention may receive data either from a CPU or from I/O devices, and the cache includes apparatus permitting the CPU to perform cache read operations while the main memory is receiving data from an I/O device.

Patent
Michael Howard Hartung1
20 Nov 1985
TL;DR: In this paper, a data storage hierarchy environment with a volatile cache and a magnetic recorder as a backing store is described, where the exempted use data need to be stored only in or primarily in the cache while the retentive data is primarily stored in the Retentive store and selectively in cache.
Abstract: Data supplied to a data storage system by a host processor has one of two use status. A first use status is that the supplied data is to be retentively stored in the data storage system. A second use status is that the supplied data is exempted from the retentive storage requirement. An example of exempted use status is that data only temporarily stored in the data storage system, i.e. is transitory. A second example is data that is being manipulated prior to retentive storage, data that is temporarily volatile. Termination of the exempted use status results in either discard or a retentive storage of the exempted use data. Data integrity controls for the exempted use status data are described. The invention is described for a data storage hierarchy environment having a volatile cache and a magnetic recorder as a backing store. The exempted use data need be stored only in or primarily in the cache while retentive data is primarily stored in the retentive store and selectively in the cache.

01 Jan 1985
TL;DR: In this article, a new approach, structure-free name management, separates three activities: choosing names, selecting the storage sites for object attributes, and resolving an object's name to its attributes.
Abstract: Name services facilitate sharing in distributed environments by allowing objects to be named unambiguously and maintaining a set of application-defined attributes for each named object. Existing distributed name services, which manage names based on their syntactic structure, may lack the flexibility needed by large, diverse, and evolving computing communities. A new approach, structure-free name management, separates three activities: choosing names, selecting the storage sites for object attributes, and resolving an object''s name to its attributes. Administrative entities apportion the responsibility for managing various names, while the name service''s information needed to locate an object''s attributes can be independently reconfigured to improve performance or meet changing demands. An analytical performance model for distributed name services provides assessments of the effect of various design and configuration choices on the cost of name service operations. Measurements of Xerox''s Grapevine registration service are used as inputs to the model to demonstrate the benefits of replicating an object''s attributes to coincide with sizeable localities of interest. Additional performance benefits result from client''s acquiring local caches of name service data treated as hints. A cache management strategy that maintains a minimum level of cache accuracy is shown to be more effective than the usual technique of maximizing the hit ratio; cache managers can guarantee reduced overall response times, even though clients must occasionally recover from outdated cache data.

01 Dec 1985
TL;DR: This paper presents a specification of SPUR and the results of some early architectural experiments, which include a large virtually-tagged cache, address translation without a translation buffer, LISP support with datatype tags but without microcode, multiple cache consistency in hardware, and an IEEE floating-point coprocessor without micro code.
Abstract: SPUR (Symbolic Processing Using RISCs) is a workstation for conducting parallel processing research. SPUR contains 6 to 12 high-performance homogeneous processors connected with a shared bus. The number of processors is large enough to permit parallel processing experiments, but small enough to allow packaging as a personal workstation. The restricted processor count also allows us to build powerful RISC processors, which include support for Lisp and IEEE floating-point, at reasonable cost. This paper presents a specification of SPUR and the results of some early architectural experiments. SPUR features include a large virtually-tagged cache, address translation without a translation buffer, LISP support with datatype tags but without microcode, multiple cache consistency in hardware, and an IEEE floating-point coprocessor without microcode.

Journal ArticleDOI
TL;DR: It is shown that a cache of size h, applied optimally to a uniformly random sequence on an alphabet of size d, is able to avoid faults with probability of order h d.

Journal ArticleDOI
C. P. Grossman1
TL;DR: It is shown that a cache as a high-speed intermediary between the processor and DASD is a major and effective step toward matching processor speed and DasD speed.
Abstract: This paper discusses three examples of a cache-DASD storage design. Precursors and developments leading up to the IBM 3880 Storage Control Subsystems are presented. The development of storage hierarchies is discussed, and the role of cache control units in the storage hierarchy is reviewed. Design and implementations are presented. Other topics discussed are cache management, performance of the subsystem, and experience using the subsystem. It is shown that a cache as a high-speed intermediary between the processor and DASD is a major and effective step toward matching processor speed and DASD speed.

Patent
12 Aug 1985
TL;DR: In this paper, a dual cache memory system employs a search cache addressed by virtual addresses, the search cache containing a plurality of recently used, pre-translated physical addresses, and a map cache contains virtual address bound data and relocation data to map the virtual address space to physical address space.
Abstract: A dual cache memory system employs a search cache addressed by virtual addresses, the search cache containing a plurality of recently used, pre-translated physical addresses. A map cache contains virtual address bound data and relocation data to map the virtual address space to physical address space. Upon receipt of a virtual address, a search of the search cache is first conducted to retrieve the associated physical address if it has been already translated. If not, a binary search of the memory map using the map cache is conducted to find that map entry whose virtual address bound identifies the region of virtual addresses containing the virtual address being mapped. The physical address is constructed from the retrieved relocation data and virtual address, and is written into the search cache for future use.

ReportDOI
01 Jan 1985
TL;DR: An analytical performance model for distributed name services provides assessments of the effect of various design and configuration choices on the cost of name service operations and a cache management strategy that maintains a minimum leval of cache accuracy is shown to be more effective than the usual technique of maximizing the hit ratio.
Abstract: : Name services facilitate sharing in distributed environments by allowing objects to be named unambiguously and maintaining a set of application- defined attributes for each named object. Existing distributed name services, which manage names based on their syntactic structure, may lack the flexibility needed by large, diverse, and evolving computing communities. A new approach, structure-free name management, separates three activities; choosing names, selecting the storage sites for object attributes, and resolving an object's name to its attributes. Administrative entities apportion the responsibility for managing various names, while the name service's information needed to locate an object's attributes can be independently reconfigured to improve performance or meet changing demands. An analytical performance model for distributed name services provides assessments of the effect of various design and configuration choices on the cost of name service operations. Measurements of Xerox's Grapevine registration service are used as inputs to the model to demonstrate the benefits of replicating an object's attributes to coincide with sizeable localities of interest. Additional performance benefits result from client's acquiring local caches of name service data treated as hints. A cache management strategy that maintains a minimum leval of cache accuracy is shown to be more effective than the usual technique of maximizing the hit ratio.

Patent
01 Apr 1985
TL;DR: In this article, access to the cache table is enhanced by associating MSB portions of a virtual address with corresponding upper and lower portions of an associated cache address in a processor system with a virtual memory organization and a cache memory table.
Abstract: In a processor system with a virtual memory organization and a cache memory table storing the physical addresses corresponding to the most-recenty used virtual addresses, access to the cache table is enhanced by associating upper and lower MSB portions of a virtual address with corresponding upper and lower portions of an associated cache address. The separate cache address portions are placed in separate cache address storage devices. Each cache address storage device is addressed by respective virtual address MSB portions. A physical address storage device stores physical addresses translated from virtual addresses in storage locations addressed by cache addresses associated with the respective virtual addresses from which the physical addresses were translated.

Proceedings ArticleDOI
08 Jul 1985
TL;DR: A cognitive basis for anaphora resolution and focusing is provided and the use of definite noun phrases and pronouns to refer to antecedents not in focus is demonstrated.
Abstract: Anaphora resolution is the process of determining the referent of anaphors, such as definite noun phrases and pronouns, in a discourse. Computational linguists, in modeling the process of anaphora resolution. have proposed the notion of focusing. Focusing is the process, engaged in by a reader of selecting a subset of the discourse items and making them highly available for further computations. This paper provides a cognitive basis for anaphora resolution and focusing. Human memory is divided into a short-term, an operating, and a long-term memory. Short-term memory can only contain a small number of meaning units and its retrieval time is fast. Short-term memory is divided into a cache and a buffer. The cache contains a subset of meaning units expressed in the previous sentences and the buffer holds a representation of the incoming sentence. Focusing is realized in the cache that contains a subset of the most topical units and a subset of the most recent units in the text. The information stored in the cache is used to integrate the incoming sentence with the preceding discourse. Pronouns should be used to refer to units in focus. Operating memory contains a very large number of units but its retrieval time is slow. It contains the previous text units that are not in the cache. It comprises the text units not in focus. Definite noun phrases should be used to refer to units not in focus. Two empirical studies are described that demonstrate the cognitive basis for focusing, the use of definite noun pphrases to refer to antecedents not in focus. and the use of pronouns to refer to antecedents in focus.

Patent
01 May 1985
TL;DR: In this paper, a cache memory control system has a segment descriptor with a 1-bit cache memory unit designation field, and a register for storing data representing the cache memory units designation field.
Abstract: A cache memory control system has a segment descriptor with a 1-bit cache memory unit designation field, and a register for storing data representing the cache memory unit designation field. An output from the register is supplied to one cache memory unit, whereas inverted data of the output from the register is supplied to the other cache memory unit.

01 Jan 1985
TL;DR: The advantages and disadvantages of a large number of locations of processors, caches, buses, MMUs, and main memories were discussed and a solution to the MMU coherency problem was proposed.
Abstract: Virtually addressed caches offer advantages of improved performance and simplicity of design over real addressed caches. They have not been generally used because their implementation presents some difficulties. A technique was devised to allow the use of virtually addressed cache by multiple processes sharing global memory without cache coherency problems. When the question of how to best combine I/O subsystems with virtually addressed cache using that technique was raised, several more problems were discovered. These included the MMU coherency problem and the question of whether the MMU should be associated with the processor or with main memory. The advantages and disadvantages of a large number of locations of processors, caches, buses, MMUs, and main memories were discussed. Associating the MMU(s) with main memory rather than with the cache or the processor has a number of advantages. These advantages include a solution to the MMU coherency problem, better performance, virtual addresses for I/O which yields uniform addresses for all references, and simplicity of design. An implementation of the ideas developed in this dissertation is proposed. The system to be implemented is a multiprocessor workstation using shared global memory for multiprocessing and multiprogramming tasks. Operating system and system software issues are discussed. In the uniprocessor case, the expected performance gain due to using virtually addressed cache is significant, primarily because it allows non-paged address translation units to be used. Comparisons were made between real address cache architectures and virtual address cache architectures. In the multiprocessor case, there is also a gain in performance for all of the reasons which apply with uniprocessors, plus a reduction in bus contention. There is also a considerable reduction in the complexity of the system. All of the processors, including I/O processors, can be treated in a uniform fashion with respect to the protocol for memory access. Each processor deals with virtual addresses only. Translation of virtual addresses is defered until a main memory reference occurs. The main memory translates the virual address to a real address, maintains cache coherency between the various processors, detects page faults, and transfers data.

Patent
23 Aug 1985
TL;DR: In this article, the cache memory control circuit detects whether access operation of the processor is directed to a particular region of the memory, and when the data is to be read out from, or is to write onto, the particular region, the data are copied onto the cache, and operation of memory is executed immediately without waiting for the reference of cache memory.
Abstract: A cache memory contained in a processor features a high efficiency in spite of its small capacity. In the cache memory control circuit, it is detected whether the access operation of the processor is directed to a particular region of the memory, and when the data is to be read out from, or is to be written onto, the particular region, the data is copied onto the cache memory and when the data is to be read out from other regions, operation of the memory is executed immediately without waiting for the reference of cache memory. By assigning the particular region for the data that is to be used repeatedly, it is possible to provide a cache memory having good efficiency in spite of its small capacity. A representative example of such data is the data in a stack.

Patent
27 Sep 1985
TL;DR: In this paper, a cache memory unit is constructed to have a two-stage pipeline shareable by a plurality of sources which include two independently operated central processing units (CPUs).
Abstract: A cache memory unit is constructed to have a two-stage pipeline shareable by a plurality of sources which include two independently operated central processing units (CPUs). Apparatus included within the cache memory unit operates to allocate alternate time slots to the two CPUs which offset their operations by a pipeline stage. This permits one pipeline stage of the cache memory unit to perform a directory search for one CPU while the other pipeline stage performs a data buffer read for the other CPU. Each CPU is programmed to use less than all of the time slots allocated to it. Thus, the processing units operate conflict-free while pipeline stages are freed up for processing requests from other sources, such as replacement data from main memory or cache updates.