scispace - formally typeset
Search or ask a question

Showing papers on "Cache published in 1987"


Journal ArticleDOI
01 Oct 1987
TL;DR: The goals, architecture, implementation and performance analysis of the Firefly are described, some measurements of hardware performance are presented, and the degree to which SRC has been successful in producing software to take advantage of multiprocessing is discussed.
Abstract: Firefly is a shared-memory multiprocessor workstation that contains from one to seven MicroVAX 78032 processors, each with a floating point unit and a sixteen kilobyte cache. The caches are coherent, so that all processors see a consistent view of main memory. A system may contain from four to sixteen megabytes of storage. Input-output is done via a standard DEC QBus. Input-output devices are an Ethernet controller, fixed disks, and a monochrome 1024 x 768 display with keyboard and mouse. Optional hardware includes a high resolution color display and a controller for high capacity disks. Figure 1 is a system block diagram.The Firefly runs a software system that emulates the Ultrix system call interface. It also supports medium- and coarse-grained multiprocessing through multiple threads of control in a single address space. Communications are implemented uniformly through the use of remote procedure calls.This paper describes the goals, architecture, implementation and performance analysis of the Firefly. It then presents some measurements of hardware performance, and discusses the degree to which SRC has been successful in producing software to take advantage of multiprocessing.

292 citations


Proceedings ArticleDOI
01 Jun 1987
TL;DR: The model indicates that a system of over 1000 usable MIPS can be constructed using high performance microprocessors and that the additional coherency protocol overhead introduced by the clustered approach is small.
Abstract: A new, large scale multiprocessor architecture is presented in this paper. The architecture consists of hierarchies of shared buses and caches. Extended versions of shared bus multicache coherency protocols are used to maintain coherency among all caches in the system. After explaining the basic operation of the strict hierarchical approach, a clustered system is introduced which distributes the memory among groups of processors. Results of simulations are presented which demonstrate that the additional coherency protocol overhead introduced by the clustered approach is small. The simulations also show that a 128 processor multiprocessor can be constructed using this architecture which will achieve a substantial fraction of its peak performance. Finally, an analytic model is used to explore systems too large to simulate (with available hardware). The model indicates that a system of over 1000 usable MIPS can be constructed using high performance microprocessors.

238 citations


ReportDOI
01 Jan 1987
TL;DR: Techniques are developed in this dissertation to efficiently evaluate direct-mapped and set-associative caches and examine instruction caches for single-chip RISC microprocessors, and it is demonstrated that instruction buffers will be preferred to target instruction buffers in future RISCmicroprocessors implemented on single CMOS chips.
Abstract: Techniques are developed in this dissertation to efficiently evaluate direct-mapped and set-associative caches These techniques are used to study associativity in CPU caches and examine instruction caches for single-chip RISC microprocessors This research is motivated in general by the importance of cache memories to computer performance, and more specifically by work done to design the caches in SPUR, a multiprocessor workstation designed at UC Berkeley The studies focus not only on abstract measures of performance such as miss ratios, but also include, when appropriate, detailed implementation factors, such as access times and gate delays The simulation algorithms developed compute miss ratios for numerous alternative caches with one pass through an address trace, provided all caches have the same block size, and use demand fetching and LRU replacement One algorithm (forest simulation) simulates direct-mapped caches by relying on inclusion, a property that all larger caches contain a superset of the data in smaller caches The other algorithm (all associativity simulation) simulates a broader class of direct-mapped and set-associative caches than could previously be studied with a one-pass algorithm, although somewhat less efficiently than forest simulation, since inclusion does not hold The analysis of set-associative caches yields two major results First, constant factors are obtained which relate to the miss ratios for set-associative caches to miss ratios for other set-associative caches Then those results are combined with sample cache implementations to show that above certain cache sizes, direct-mapped caches have lower effective access times than set-associative caches, despite having higher miss ratios Finally, instruction buffers and target instruction buffers are examined as organizations for instruction memory on single-chip microprocessors The analysis focuses closely on implementation considerations, including the interaction between instruction fetches, instruction prefetches and data references, and uses the SPUR RISC design as the case study Results show the effects of varying numerous design parameters, suggest some superior designs, and demonstrate that instruction buffers will be preferred to target instruction buffers in future RISC microprocessors implemented on single CMOS chips

236 citations


Journal ArticleDOI
TL;DR: In this article, the authors examined the cache miss ratio as a function of line size, and found that for high performance microprocessor designs, line sizes in the range 16-64 bytes seem best; shorter line sizes yield high delays due to memory latency, although they reduce memory traffic somewhat.
Abstract: The line (block) size of a cache memory is one of the parameters that most strongly affects cache performance. In this paper, we study the factors that relate to the selection of a cache line size. Our primary focus is on the cache miss ratio, but we also consider influences such as logic complexity, address tags, line crossers, I/O overruns, etc. The behavior of the cache miss ratio as a function of line size is examined carefully through the use of trace driven simulation, using 27 traces from five different machine architectures. The change in cache miss ratio as the line size varies is found to be relatively stable across workloads, and tables of this function are presented for instruction caches, data caches, and unified caches. An empirical mathematical fit is obtained. This function is used to extend previously published design target miss ratios to cover line sizes from 4 to 128 bytes and cache sizes from 32 bytes to 32K bytes; design target miss ratios are to be used to guide new machine designs. Mean delays per memory reference and memory (bus) traffic rates are computed as a function of line and cache size, and memory access time parameters. We find that for high performance microprocessor designs, line sizes in the range 16-64 bytes seem best; shorter line sizes yield high delays due to memory latency, although they reduce memory traffic somewhat. Longer line sizes are suitable for mainframes because of the higher bandwidth to main memory.

180 citations


Proceedings ArticleDOI
01 Jun 1987
TL;DR: It is shown that cache coherence protocols can implement indivisible synchronization primitives reliably and can also enforce sequential consistency and it is shown how such protocols can implementation atomic READ&MODIFY operations for synchronization purposes.
Abstract: This paper shows that cache coherence protocols can implement indivisible synchronization primitives reliably and can also enforce sequential consistency. Sequential consistency provides a commonly accepted model of behavior of multiprocessors. We derive a simple set of conditions needed to enforce sequential consistency in multiprocessors. These conditions are easily applied to prove the correctness of existing cache coherence protocols that rely on one or multiple broadcast buses to enforce atomicity of updates; in these protocols, all processing elements must be connected to the broadcast buses. The conditions are also used in this paper to establish new protocols which do not rely on the atomicity of updates and therefore do not require single access buses to propagate invalidations or to perform distributed WRITEs. It is also shown how such protocols can implement atomic READ&MODIFY operations for synchronization purposes.

174 citations


Patent
31 Aug 1987
TL;DR: In this paper, a technique for performing fast write operation is described for performing a host write request, which would normally be serviced by an immediate physical write to a data storage device, is instead written to cache and nonvolatile storage in the data storage devices controller.
Abstract: A technique is described for performing a fast write operation. A host write request, which would normally be serviced by an immediate physical write to a data storage device, is instead written to cache and nonvolatile storage in the data storage device controller. Then, the controller signals the host that the write operation is complete and does not update the physical data storage device until later. A journal log is also used to provide recovery capability in the event of system failure. This technique provides high performance for the units' operation while assuring integrity by keeping two copies of the write operation until the physical update transpires.

166 citations


Proceedings ArticleDOI
01 Jun 1987
TL;DR: This paper derives several properties of checkpoint repair mechanisms and provides algorithms for performing checkpoint repair that incur very little overhead in time and modest cost in hardware.
Abstract: Out-of-order execution and branch prediction are two mechanisms that can be used profitably in the design of Supercomputers to increase performance Unfortunately this means there must be some kind of repair mechanism, since situations do occur that require the computing engine to repair to a known previous state One way to handle this is by checkpoint repair In this paper we derive several properties of checkpoint repair mechanisms In addition, we provide algorithms for performing checkpoint repair that incur very little overhead in time and modest cost in hardware We also note that our algorithms require no additional complexity or time for use with write back cache memory systems than they do with write through cache memory systems, contrary to statements made by previous researchers

166 citations


Journal ArticleDOI
TL;DR: This paper develops an analytical model for cache-reload transients and compares the model to observations based on several address traces and shows that the size of the transient is related to the normal distribution function.
Abstract: This paper develops an analytical model for cache-reload transients and compares the model to observations based on several address traces. The cache-reload transient is the set of cache misses that occur when a process is reinitiated after being suspended temporarily. For example, an interrupt program that runs periodically experiences a reload transient at each initiation. The reload transient depends on the cache size and on the sizes of the footprints in the cache of the competing programs, where a program footprint is defined to be the set of lines in the cache in active use by the program. The model shows that the size of the transient is related to the normal distribution function. A simulation based on program-address traces shows excellent agreement between the model and the observations.

131 citations


BookDOI
01 Jan 1987
TL;DR: This work focuses on the development of a Analytical Cache Model for Multiprogramming Cache Performance, which automates the very labor-intensive and therefore time-heavy and expensive process of manually cataloging and analyzing caches.
Abstract: 1 Introduction.- 1.1 Overview of Cache Design.- 1.1.1 Cache Parameters.- 1.1.2 Cache Performance Evaluation Methodology.- 1.2 Review of Past Work.- 1.3 Then, Why This Research?.- 1.3.1 Accurately Characterizing Large Cache Performance.- 1.3.2 Obtaining Trace Data for Cache Analysis.- 1.3.3 Developing Efficient and Accurate Cache Analysis Methods.- 1.4 Contributions.- 1.5 Organization.- 2 Obtaining Accurate Trace Data.- 2.1 Current Tracing Techniques.- 2.2 Tracing Using Microcode.- 2.3 An Experimental Implementation.- 2.3.1 Storage of Trace Data.- 2.3.2 Recording Memory References.- 2.3.3 Tracing Control.- 2.4 Trace Description.- 2.5 Applications in Performance Evaluation.- 2.6 Extensions and Summary.- 3 Cache Analyses Techniques - An Analytical Cache Model.- 3.1 Motivation and Overview.- 3.1.1 The Case for the Analytical Cache Model.- 3.1.2 Overview of the Model.- 3.2 A Basic Cache Model.- 3.2.1 Start-Up Effects.- 3.2.2 Non-Stationary Effects.- 3.2.3 Intrinsic Interference.- 3.3 A Comprehensive Cache Model.- 3.3.1 Set Size.- 3.3.2 Modeling Spatial Locality and the Effect of Block Size.- 3.3.3 Multiprogramming.- 3.4 Model Validation and Applications.- 3.5 Summary.- 4 Transient Cache Analysis - Trace Sampling and Trace Stitching.- 4.1 Introduction.- 4.2 Transient Behavior Analysis and Trace Sampling.- 4.2.1 Definitions.- 4.2.2 Analysis of Start-up Effects in Single Process Traces.- 4.2.3 Start-up Effects in Multiprocess Traces.- 4.3 Obtaining Longer Samples Using Trace Stitching.- 4.4 Trace Compaction - Cache Filtering with Blocking.- 4.4.1 Cache Filter.- 4.4.2 Block Filter.- 4.4.3 Implementation of the Cache and Block Filters.- 4.4.4 Miss Rate Estimation.- 4.4.5 Compaction Results.- 5 Cache Performance Analysis for System References.- 5.1 Motivation.- 5.2 Analysis of the Miss Rate Components due to System References.- 5.3 Analysis of System Miss Rate.- 5.4 Associativity.- 5.5 Block Size.- 5.6 Evaluation of Split Caches.- 6 Impact of Multiprogramming on Cache Performance.- 6.1 Relative Performance of Multiprogramming Cache Techniques.- 6.2 More on Warm Start versus Cold Start.- 6.3 Impact of Shared System Code on Multitasking Cache Performance.- 6.4 Process Switch Statistics and Their Effects on Cache ModeUng.- 6.5 Associativity.- 6.6 Block Size.- 6.7 Improving the Multiprogramming Performance of Caches.- 6.7.1 Hashing.- 6.7.2 A Hash-Rehash Cache.- 6.7.3 Split Caches.- 7 Multiprocessor Cache Analysis.- 7.1 Tracing Multiprocessors.- 7.2 Characteristics of Traces.- 7.3 Analysis.- 7.3.1 General Methodology.- 7.3.2 Multiprocess Interference in Large Virtual and Physical Caches.- 7.3.3 Analysis of Interference Between Multiple Processors.- 7.3.4 Blocks Containing Semaphores.- 8 Conclusions and Suggestions for Future Work.- 8.1 Concluding Remarks.- 8.2 Suggestions for Future Work.- Appendices.- B.1 On the Stability of the Collision Rate.- B.2 Estimating Variations in the Collision Rate.- C Inter-Run Intervals and Spatial Locality.- D Summary of Benchmark Characteristics.- E Features of ATUM-2.- E.1 Distributing Trace Control to All Processors.- E.2 Provision of Atomic Accesses to Trace Memory.- E.3 Instruction Stream Compaction Using a Cache Simulated in Microcode.- E.4 Microcode Patch Space Conservation.

118 citations


Patent
08 Sep 1987
TL;DR: In this paper, a symbolic language data processing system consisting of a sequencer unit, a data path unit and a memory control unit is presented. But this system is limited to symbolic language processing.
Abstract: A symbolic language data processing system comprises a sequencer unit, a data path unit, a memory control unit, a front-end processor, an I/O and a main memory connected on a common Lbus to which other peripherals and data units can be connected for intercommunication. The system architecture includes a novel bus network, a synergistic combination of the Lbus, microtasking, centralized error correction circuitry and a synchronous pipelined memory including processor mediated direct memory access, stack cache windows with two segment addressing, a page hash table and page hash table cache, garbage collection and pointer control, a close connection of the macrocode and microcode which enables one to take interrupts in and out of the macrocode instruction sequences, parallel data type checking with tagged architecture, procedure call and microcode support, a generic bus and a unique instruction set to support symbolic language processing.

117 citations


Journal ArticleDOI
TL;DR: The authors provide an overview of MIPS-X, focusing on the techniques used to reduce the complexity of the processor and implement the on-chip instruction cache.
Abstract: MIPS-X is a 32-b RISC microprocessor implemented in a conservative 2-/spl mu/m, two-level-metal, n-well CMOS technology. High performance is achieved by using a nonoverlapping two-phase 20-MHz clock and executing one instruction every cycle. To reduce its memory bandwidth requirements, MIPS-X includes a 2-kbyte on-chip instruction cache. The authors provide an overview of MIPS-X, focusing on the techniques used to reduce the complexity of the processor and implement the on-chip instruction cache.

Journal ArticleDOI
Douglas B. Terry1
TL;DR: A new approach to managing caches of hints suggests maintaining a minimum level of cache accuracy, rather than maximizing the cache hit ratio, in order to guarantee performance improvements.
Abstract: Caching reduces the average cost of retrieving data by amortizing the lookup cost over several references to the data Problems with maintaining strong cache consistency in a distributed system can be avoided by treating cached information as hints A new approach to managing caches of hints suggests maintaining a minimum level of cache accuracy, rather than maximizing the cache hit ratio, in order to guarantee performance improvements The desired accuracy is based on the ratio of lookup costs to the costs of detecting and recovering from invalid cache entries Cache entries are aged so that they get purged when their estimated accuracy falls below the desired level The age thresholds are dictated solely by clients' accuracy requirements instead of being suggested by data storage servers or system administrators

Journal ArticleDOI
01 Oct 1987
TL;DR: In this article, a multiprocessor cache memory system is described that supplies data to the processor based on virtual addresses, but maintains consistency in the main memory, both across caches and across virtual address spaces.
Abstract: A multiprocessor cache memory system is described that supplies data to the processor based on virtual addresses, but maintains consistency in the main memory, both across caches and across virtual address spaces. Pages in the same or different address spaces may be mapped to share a single physical page. The same hardware is used for maintaining consistency both among caches and among virtual addresses. Three different notions of a cache "block" are defined: (1) the unit for transferring data to/from main storage, (2) the unit over which tag information is maintained, and (3) the unit over which consistency is maintained. The relation among these block sizes is explored, and it is shown that they can be optimized independently. It is shown that the use of large address blocks results in low overhead for the virtual address cache.

Patent
18 Aug 1987
TL;DR: In this article, a pixel data/partial address muliplexing method based on programmable tile size is proposed to reduce the number of interconnections between a pixel interpolator and the frame buffer without significantly increasing the bus cycles needed to transfer the information.
Abstract: A graphics system uses a programmable tile size and shape supported by a frame buffer memory organization wherein (X, Y) pixel addresses map into regularly offset permutations on groups of RAM address and data line assignments. This allows one RAM in eac group to be accessed with a memory cycle in unison with one RAM in each other group, up to the number of groups. During such a memory cycle each RAM can receive a different address. A tile is the collection of pixel locations associated with a collection of addresses sent to the RAM's. Because of the regular nature of the permutations these locations may be regions bounded by a single boundary that may be rectangular and of varying size and shape. Changing the mapping of (X, Y) pixel addresses to RAM addresses for the groups changes the size and shape of the tiles. A pixel data/partial address muliplexing method based on programmable tile size reduces the number of interconnections between a pixel interpolator and the frame buffer without significantly increasing the number of bus cycles needed to transfer the information. Tiles are cached. Tiles for RGB pixel values are cached in an RGB cache, while Z values are cached in a separate cache. Caching allows the principle of locality to substitute shorter bit-cycles to the cache for memory cycles to the frame buffer, resulting in improved memory throughput.

Proceedings ArticleDOI
01 Jun 1987
TL;DR: This paper examines the design of a second generation VLSI RISC processor, MIPS-X, and examines several key areas, including the organization of the on-chip instruction cache, the coprocessor interface, branches and the resulting branch delay, and exception handling.
Abstract: The design of a RISC processor requires a careful analysis of the tradeoffs that can be made between hardware complexity and software As new generations of processors are built to take advantage of more advanced technologies, new and different tradeoffs must be considered We examine the design of a second generation VLSI RISC processor, MIPS-XMIPS-X is the successor to the MIPS project at Stanford University and like MIPS, it is a single-chip 32-bit VLSI processor that uses a simplified instruction set, pipelining and a software code reorganizer However, in the quest for higher performance, MIPS-X uses a deeper pipeline, a much simpler instruction set and achieves the goal of single cycle execution using a 2-phase, 20 MHz clock This has necessitated the inclusion of an on-chip instruction cache and careful consideration of the control of the machine Many tradeoffs were made during the design of MIPS-X and this paper examines several key areas They are: the organization of the on-chip instruction cache, the coprocessor interface, branches and the resulting branch delay, and exception handling For each issue we present the most promising alternatives considered for MIPS-X and the approach finally selected Working parts have been received and this gives us a firm basis upon which to evaluate the success of our design

Patent
23 Oct 1987
TL;DR: In this paper, a method for monitoring and collecting data in a multi-tier computer system, referred to as an "open" message is transmitted to a database cache computer with a list of data items in the database cache to be monitored on a change-of-state basis.
Abstract: In a method for monitoring and collecting data in a multi-tier computer system, a database operation message, referred to as an "open" message is transmitted to a database cache computer with a list of data items in the database cache computer to be monitored on a change-of-state basis. The database cache computer responds by monitoring the data items and returning unsolicited "change data" messages containing only states for data items which have changed over the monitoring period. The change data messages are sent back periodically without the need for polling by a higher-level computer. The monitoring process is terminated by closing data records in the higher-level computer which generates a "close" message to the database computer to terminate the transmission of the change data messages. Also disclosed is a database cache computer and a user interface computer for carrying out the method.

Patent
02 Dec 1987
TL;DR: In this paper, a broadband branch history table is organized by cache line, which determines from the history of branches the next cache line to be referenced and uses that information for prefetching lines into the cache.
Abstract: Apparatus for fetching instructions in a computing system. A broadband branch history table is organized by cache line. The broadband branch history table determines from the history of branches the next cache line to be referenced and uses that information for prefetching lines into the cache.

Patent
27 Mar 1987
TL;DR: In this paper, the cache coherence system detects when the contents of storage locations in the cache memories of the one or more of the data processors have been modified in conjuction with the activity those data processors and is responsive to such detections to generate and store in its cache invalidate table (CIT) memory a multiple element linked list.
Abstract: A cache coherence system for a multiprocessor system including a plurality of data processors coupled to a common main memory. Each of the data processors includes an associated cache memory having storage locations therein corresponding to storage locations in the main memory. The cache coherence system for a data processor includes a cache invalidate table (CIT) memory having internal storage locations corresponding to locations in the cache memory of the data processor. The cache coherence system detects when the contents of storage locations in the cache memories of the one or more of the data processors have been modified in conjuction with the activity those data processors and is responsive to such detections to generate and store in its CIT memory a multiple element linked list defining the locations in the cache memories of the data processors having modified contents. Each element of the list defines one of those cache storage locations and also identifies the location in the CIT memory of the next element in the list.

Proceedings ArticleDOI
J. H. Chang1, H. Chao1, K. So1
01 Jun 1987
TL;DR: An innovative cache accessing scheme based on high MRU (most recently used) hit ratio is proposed for the design of a one-cycle cache in a CMOS implementation of System/370 and it is shown that with this scheme the cache access time is reduced, and the performance is within 4% of a true one- cycle cache.
Abstract: An innovative cache accessing scheme based on high MRU (most recently used) hit ratio [1] is proposed for the design of a one-cycle cache in a CMOS implementation of System/370. It is shown that with this scheme the cache access time is reduced by 30 ~ 35% and the performance is within 4% of a true one-cycle cache. This cache scheme is proposed to be used in a VLSI System/370, which is organized to achieve high performance by taking advantage of the performance and integration level of an advanced CMOS technology with half-micron channel length [2]. Decisions on the system partition are based on technology limitations, performance considerations and future extendability. Design decisions on various aspects of the cache organization are based on trace simulations for both UP (uniprocessor) and MP (multiprocessor) configurations.

Patent
18 Aug 1987
TL;DR: In this article, a programmable pipelined shifter allows the dynamic alteration of the mapping between bits of the RGB intensity values and the planes of the frame buffer into which those bits are stored, as well as allowing those values to be truncated to specified lengths.
Abstract: A graphics system uses a programmable tile size and shape supported by a frame buffer memory organization wherein (X, Y) pixel addresses map into regularly offset permutations on groups of RAM address and data line assignments. Changing the mapping of (X, Y) pixel addresses to RAM addresses for the groups changes the size and shape of the tiles. A pixel data/partial address multiplexing method based on programmable tile size reduces the number of interconnections between a pixel interpolator and the frame buffer. A programmable pipelined shifter allows the dynamic alteration of the mapping between bits of the RGB intensity values and the planes of the frame buffer into which those bits are stored, as well as allowing those values to be truncated to specified lengths. Tiles are cached. Tiles for RGB pixel values are cached in an RGB cache, while Z values are cached in a separate cache. The Z buffer for hidden surface removal need not be a full size frame buffer, as a lesser portion of frame buffer is, if need be, used repeatedly. Updates to the color map are performed from a separate shadow RAM during vertical retrace. The shadow RAM is large enough to accommodate two copies of the color map, and can load them in automatic alternation, producing a blinking effect without the use of an additional plane of frame buffer memory.

Patent
Jr Thomas Henry Holman1
27 Jul 1987
TL;DR: A write-shared cache circuit for multiprocessor systems maintains data consistency throughout the system and eliminates non-essential bus accesses by utilizing additional bus lines between caches of the system as mentioned in this paper.
Abstract: A "write-shared" cache circuit for multiprocessor systems maintains data consistency throughout the system and eliminates non-essential bus accesses by utilizing additional bus lines between caches of the system and by utilizing additional logic in order to enhance the intercache communication. Data is only written through to the system bus when the data is labeled "shared". A write-miss is read only once on the system bus in an "invalidate" cycle, and then it is written only to the requesting cache.

01 Jan 1987
TL;DR: These techniques are significant extensions to the stack analysis technique (Mattson et al., 1970) which computes the read miss ratio for all cache sizes in a single trace-driven simulation, and are used to study caching in a network file system.
Abstract: This dissertation describes innovative techniques for efficiently analyzing a wide variety of cache designs, and uses these techniques to study caching in a network file system. The techniques are significant extensions to the stack analysis technique (Mattson et al., 1970) which computes the read miss ratio for all cache sizes in a single trace-driven simulation. Stack analysis is extended to allow the one-pass analysis of: (1) writes in a write-back cache, including periodic write-back and deletions, important factors in file system cache performance. (2) sub-block or sector caches, including load-forward prefetching. (3) multi-processor caches in a shared-memory system, for an entire class of consistency protocols, including all of the well-known protocols. (4) client caches in a network file system, using a new class of consistency protocols. The techniques are completely general and apply to all levels of the memory hierarchy, from processor caches to disk and file system caches. The dissertation also discusses the use of hash tables and binary trees within the simulator to further improve performance for some types of traces. Using these techniques, the performance of all cache sizes can be computed in little more than twice the time required to simulate a single cache size, and often in just 10% more time. In addition to presenting techniques, this dissertation also demonstrates their use by studying client caching in a network file system. It first reports the extent of file sharing in a UNIX environment, showing that a few shared files account for two-thirds of all accesses, and nearly half of these are to files which are both read and written. It then studies different cache consistency protocols, write policies, and fetch policies, reporting the miss ratio and file server utilization for each. Four cache consistency protocols are considered: a polling protocol that uses the server for all consistency controls; a protocol designed for single-user files; one designed for read-only files; and one using write-broadcast to maintain consistency. It finds that the choice of consistency protocol has a substantial effect on performance; both the read-only and write-broadcast protocols showed half the misses and server load of the polling protocol. The choice of write or fetch policy made a much smaller difference.

Patent
Jerry Duane Dixon1, Guy G. Sotomayor1
15 Dec 1987
TL;DR: In this article, the authors present a DASD caching system in which pages of sectors of data are stored by reading in a desired sector and prefetching a plurality of adjacent sectors for later access.
Abstract: In a DASD caching system, in which pages of sectors of data are stored by reading in a desired sector and prefetching a plurality of adjacent sectors for later access, errors in disk storage media cause error signals to be generated. Such errors are handled by storing indications of which sectors have errors and which do not, and accessing such indications in response to later requests for such sectors. Such indications are stored in each page in the cache. Further, a history is maintained of which pages and sectors therein, were placed in the cache in the past.

Patent
15 Sep 1987
TL;DR: In this paper, a mechanism for determining when the contents of a block in a cache memory have been rendered stale by DMA activity external to a processor and for marking the block stale in response to a positive determination is proposed.
Abstract: A mechanism for determining when the contents of a block in a cache memory have been rendered stale by DMA activity external to a processor and for marking the block stale in response to a positive determination. The commanding unit in the DMA transfer, prior to transmitting an address, asserts a cache control signal which conditions the processor to receive the address and determine whether there is a correspondence to the contents of the cache. If there is a correspondence, the processor marks the contents of that cache location for which there is a correspondence stale.

Patent
31 Jul 1987
TL;DR: In this article, a look ahead fetch system for a pipelined digital computer is provided for predicting in advance decoding the outcome of a branch instruction, which includes a branch cache having a plurality of associative sets for storing branch target addresses indexed by the lowest significant bits of the corresponding branch instruction's address.
Abstract: A look ahead fetch system for a pipelined digital computer is provided for predicting in advance of decoding the outcome of a branch instruction. The system includes a branch cache having a plurality of associative sets for storing branch target addresses indexed by the lowest significant bits of the corresponding branch instruction's address. A memory for storing a coupling bit vector indicative for each branch cache set of whether the set contains a corresponding branch target address. The coupling bit vector is used to guide prediction logic to the correct branch cache sets for qualifying the entry there contained having an index corresponding to a fetched instruction's address for formulating a prediction of the next instruction to be processed.

Proceedings Article
01 Jan 1987
TL;DR: Using the multiprocessor cache model for comparison, data prefetching is found to be more effective than caches in addressing the memory access bottleneck.
Abstract: The trace driven simulation of 16 numerical subroutines is used to compare instruction lookahead and data prefetching with private caches in shared memory multiprocessors with hundreds or thousands of processors and memory modules interconnected with a pipelined network. These multiprocessors are characterized by long memory access delays that create a memory access bottleneck. Using the multiprocessor cache model for comparison, data prefetching is found to be more effective than caches in addressing the memory access bottleneck. 5 refs., 6 figs.

Patent
10 Aug 1987
TL;DR: In this paper, an indexed sequential file is made accessible for random or sequential reading of records while allowing concurrent modification to the file, each ordered group of records in the file is associated with timestamps referencing a deletion time of the group and the time that the group was last modified.
Abstract: An indexed sequential file is made accessible for random or sequential reading of records while allowing concurrent modification to the file. Each ordered group of records in the file is associated with timestamps referencing a deletion time of the group and the time that the group was last modified. During a current search in a group for a desired record, the timestamp referencing a deletion time of the group is compared to a search time established at the beginning of the search. For a sequential reading the timestamp referencing a last modification time of a group containing the desired record is compared to a respective timestamp corresponding to the reading of the preceeding record. The comparisons provide indications of whether the group to which the desired record belongs is currently the group to be searched. The most recently modified and deleted groups are stored in a cache memory. When the cache memory is full, an incoming group and respective timestamps replaces the least recent or least likely to be used group and respective timestamps. The most recent timestamps of replaced groups' timestamps are saved in local memory and are used in the comparisons for groups not currently in the cache.

Proceedings ArticleDOI
01 Jun 1987
TL;DR: In this paper, cache design is explored for large high-performance multiprocessors with hundreds or thousands of processors and memory modules interconnected by a pipe-lined multi-stage network and it is shown that the optimal cache block size in such multiprocessionors is much smaller than in many uniprocessor.
Abstract: In this paper, cache design is explored for large high-performance multiprocessors with hundreds or thousands of processors and memory modules interconnected by a pipe-lined multi-stage network. The majority of the multiprocessor cache studies in the literature exclusively focus on the issue of cache coherence enforcement. However, there are other characteristics unique to such multiprocessors which create an environment for cache performance that is very different from that of many uniprocessors.Multiprocessor conditions are identified and modeled, including, 1) the cost of a cache coherence enforcement scheme, 2) the effect of a high degree of overlap between cache miss services, 3) the cost of a pin limited data path between shared memory and caches, 4) the effect of a high degree of data prefetching, 5) the program behavior of a scientific workload as represented by 23 numerical subroutines, and 6) the parallel execution of programs. This model is used to show that the cache miss ratio is not a suitable performance measure in the multiprocessors of interest and to show that the optimal cache block size in such multiprocessors is much smaller than in many uniprocessors.

Journal ArticleDOI
TL;DR: An approach to fast image generation that uses a high-speed serial scan converter, a somewhat slower frame buffer, and a pixel cache to match the bandwidth between the two to improve the performance of the z-buffer hidden-surface algorithm.
Abstract: This article describes an approach to fast image generation that uses a high-speed serial scan converter, a somewhat slower frame buffer, and a pixel cache to match the bandwidth between the two. Cache hit rates are improved by configuring the cache to buffer either 4 × 4 or 16 × 1 tiles of frame memory, depending on the type of operation being performed. For line drawing, the implenmention discribed can process 300,000 30-pixel vectors per second. For shaded polygons, the system can fill 16,000 900-pixel polygons per second. In addition to buffering pixel intensity data, the pixel cache also buffer z (depth) values, improving the performance of the z-buffer hidden-surface algorithm. By utilizing z-value caching, the system can process 5800 900-pixel shaded polygons per second with hidden-surface removed.

Patent
23 Oct 1987
TL;DR: In this article, a database configuration message is transmitted from a cell controlling computer to a database cache computer to designate certain data items to be monitored at one or more station-level computers.
Abstract: In a multi-tier computer system, a database configuration message is transmitted from a cell controlling computer to a database cache computer to designate certain data items to be monitored at one or more station-level computers. The database cache computer is connected via a local area network to the station-level computers. The station-level computers monitor the data items and generate unsolicited messages containing changed states for data items which have changed over the monitoring period. The database cache computer receives the unsolicited message and interprets the data therein to update the relevant data items. The unsolicited messages are sent back periodically without the need for polling by the database cache computer. If desired, the data in the unsolicited messages can be limited to data which has changed since the last update of the relevant data items.