scispace - formally typeset
Search or ask a question

Showing papers on "Cache published in 1989"


Proceedings ArticleDOI
01 Nov 1989
TL;DR: An analytic model and an evaluation for file access in the V system show that leases of short duration provide good performance and the impact of leases on performance grows more significant in systems of larger scale and higher processor performance.
Abstract: Caching introduces the overhead and complexity of ensuring consistency, reducing some of its performance benefits. In a distributed system, caching must deal with the additional complications of communication and host failures.Leases are proposed as a time-based mechanism that provides efficient consistent access to cached data in distributed systems. Non-Byzantine failures affect performance, not correctness, with their effect minimized by short leases. An analytic model and an evaluation for file access in the V system show that leases of short duration provide good performance. The impact of leases on performance grows more significant in systems of larger scale and higher processor performance.

655 citations


Journal ArticleDOI
TL;DR: This paper discusses several techniques for improving I/O performance, including caches, battery-backed-up caches, and cache logging, and examines in particular detail an approach called log-structured file systems, where the file system's only representation on disk is in the form of an append-only log.
Abstract: CPU speeds are improving at a dramatic rate, while disk speeds are not. This technology shift suggests that many engineering and office applications may become so I/O-limited that they cannot benefit from further CPU improvements. This paper discusses several techniques for improving I/O performance, including caches, battery-backed-up caches, and cache logging. We then examine in particular detail an approach called log-structured file systems, where the file system's only representation on disk is in the form of an append-only log. Log-structured file systems potentially provide order-of-magnitude improvements in write performance. When log-structured file systems are combined with arrays of small disks (which provide high bandwidth) and large main-memory file caches (which satisfy most read accesses), we believe it will be possible to achieve 1000-fold improvements in I/O performance over today's systems.

409 citations


Journal ArticleDOI
TL;DR: Black-capped chickadees and other food-storing birds recover their scattered caches by remembering the spatial locations of cache sites, and hippocampal aspiration reduced the accuracy of cache recovery by chickadee to the chance rate, but it did not reduce the amount of caching or the number of attempts to recover caches.
Abstract: Black-capped chickadees and other food-storing birds recover their scattered caches by remembering the spatial locations of cache sites. Bilateral hippocampal aspiration reduced the accuracy of cache recovery by chickadees to the chance rate, but it did not reduce the amount of caching or the number of attempts to recover caches. In a second experiment, hippocampal aspiration dissociated performance of a task requiring memory for places from performance of a task requiring memory for cues associated with food, disrupting the former but not the latter

367 citations


Journal ArticleDOI
TL;DR: An analytical cache model is developed that gives miss rates for a given trace as a function of cache size, degree of associativity, block size, subblock size, multiprogramming level, task switch interval, and observation interval.
Abstract: Trace-driven simulation and hardware measurement are the techniques most often used to obtain accurate performance figures for caches. The former requires a large amount of simulation time to evaluate each cache configuration while the latter is restricted to measurements of existing caches. An analytical cache model that uses parameters extracted from address traces of programs can efficiently provide estimates of cache performance and show the effects of varying cache parameters. By representing the factors that affect cache performance, we develop an analytical model that gives miss rates for a given trace as a function of cache size, degree of associativity, block size, subblock size, multiprogramming level, task switch interval, and observation interval. The predicted values closely approximate the results of trace-driven simulations, while requiring only a small fraction of the computation cost.

345 citations


Patent
17 Nov 1989
TL;DR: In this paper, a multi-processor system and method arranged, in one embodiment, as an image and graphics processor is presented. But it does not address the problem of multi-processors sharing the same memory.
Abstract: A multi-processor system and method arranged, in one embodiment, as an image and graphics processor. The multiprocessor system includes several individual processors all having communication links to several memories. Additional instruction memories are dedicated individually as cache memories to particular processors so that the processors can function in the multiple instruction, multiple data (MIMD) mode. When the processors function in the single instruction, multiple data mode (SIMD) the dedicated memories are reassigned for access by all of the processors for data. A crossbar switch serves to establish the processor memory links. The entire image processor, including the individual processors, the crossbar switch and the memories, is contained on a single silicon chip.

288 citations


Journal ArticleDOI
01 Apr 1989
TL;DR: A set of efficient primitives for process synchronization in multiprocessors that make use of synchronization bits to provide a simple mechanism for mutual exclusion and to implement Fetch and Add with combining in software rather than hardware is proposed.
Abstract: This paper proposes a set of efficient primitives for process synchronization in multiprocessors. The only assumptions made in developing the set of primitives are that hardware combining is not implemented in the inter-connect, and (in one case) that the interconnect supports broadcast.The primitives make use of synchronization bits (syncbits) to provide a simple mechanism for mutual exclusion. The proposed implementation of the primitives includes efficient (i.e. local) busy-waiting for syncbits. In addition, a hardware-supported mechanism for maintaining a first-come first-serve queue of requests for a syncbit is proposed. This queueing mechanism allows for a very efficient implementation of, as well as fair access to, binary semaphores. We also propose to implement Fetch and Add with combining in software rather than hardware. This allows an architecture to scale to a large number of processors while avoiding the cost of hardware combining.Scenarios for common synchronization events such as work queues and barriers are presented to demonstrate the generality and ease of use of the proposed primitives. The efficient implementation of the primitives is simpler if the multiprocessor has a hardware cache-consistency protocol. To illustrate this point, we outline how the primitives would be implemented in the Multicube multiprocessor [GoWo88].

277 citations


Patent
06 Jun 1989
TL;DR: In this article, a super-scaler processor with branch-prediction information is described, where each instruction cache block stored in the instruction cache memory includes branch prediction information fields in addition to instruction fields, which indicate the address of the instruction block's successor and information indicating the location of a branch instruction within an instruction block.
Abstract: A super-scaler processor is disclosed wherein branch-prediction information is provided within an instruction cache memory. Each instruction cache block stored in the instruction cache memory includes branch-prediction information fields in addition to instruction fields, which indicate the address of the instruction block's successor and information indicating the location of a branch instruction within the instruction block. Thus, the next cache block can be easily fetched without waiting on a decoder or execution unit to indicate the proper fetch action to be taken for correctly predicted branching.

254 citations


Proceedings ArticleDOI
01 Apr 1989
TL;DR: The code performance with instruction placement optimization is shown to be stable across architectures with different instruction encoding density, and this approach achieves low cache miss ratios and low memory traffic ratios for small, fast instruction caches with little hardware overhead.
Abstract: Increasing the execution power requires a high instruction issue bandwidth, and decreasing instruction encoding and applying some code improving techniques cause code expansion. Therefore, the instruction memory hierarchy performance has become an important factor of the system performance. An instruction placement algorithm has been implemented in the IMPACT-I (Illinois Microarchitecture Project using Advanced Compiler Technology - Stage I) C compiler to maximize the sequential and spatial localities, and to minimize mapping conflicts. This approach achieves low cache miss ratios and low memory traffic ratios for small, fast instruction caches with little hardware overhead. For ten realistic UNIX* programs, we report low miss ratios (average 0.5%) and low memory traffic ratios (average 8%) for a 2048-byte, direct-mapped instruction cache using 64-byte blocks. This result compares favorably with the fully associative cache results reported by other researchers. We also present the effect of cache size, block size, block sectoring, and partial loading on the cache performance. The code performance with instruction placement optimization is shown to be stable across architectures with different instruction encoding density.

227 citations


Journal ArticleDOI
S. McFarling1
01 Apr 1989
TL;DR: This paper presents an optimization algorithm for reducing instruction cache misses that uses profile information to reposition programs in memory so that a direct-mapped cache behaves much like an optimal cache with full associativity and full knowledge of the future.
Abstract: This paper presents an optimization algorithm for reducing instruction cache misses. The algorithm uses profile information to reposition programs in memory so that a direct-mapped cache behaves much like an optimal cache with full associativity and full knowledge of the future. For best results, the cache should have a mechanism for excluding certain instructions designated by the compiler. This paper first presents a reduced form of the algorithm. This form is shown to produce an optimal miss rate for programs without conditionals and with a tree call graph, assuming basic blocks can be reordered at will. If conditionals are allowed, but there are no loops within conditionals, the algorithm does as well as an optimal cache for the worst case execution of the program consistent with the profile information. Next, the algorithm is extended with heuristics for general programs. The effectiveness of these heuristics are demonstrated with empirical results for a set of 10 programs for various cache sizes. The improvement depends on cache size. For a 512 word cache, miss rates for a direct-mapped instruction cache are halved. For an 8K word cache, miss rates fall by over 75%. Over a wide range of cache sizes the algorithm is as effective as increasing the cache size by a factor of 3 times. For 512 words, the algorithm generates only 32% more misses than an optimal cache. Optimized programs on a direct-mapped cache have lower miss rates than unoptimized programs on set-associative caches of the same size.

217 citations


Dissertation
01 Jan 1989
TL;DR: Measurements of actual supercomputer cache performance has not been previously undertaken, and PFC-Sim, a program-driven event tracing facility that can simulate data cache performance of very long programs, is used to measure the performance of various cache structures.
Abstract: Measurements of actual supercomputer cache performance has not been previously undertaken. PFC-Sim is a program-driven event tracing facility that can simulate data cache performance of very long programs. PFC-Sim simulates cache concurrently with program execution, allowing very long traces to be used. Programs with traces in excess of 4 billion entries have been used to measure the performance of various cache structures. PFC-Sim was used to measure the cache performance of array references in a benchmark set of supercomputer applications, RiCEPS. Data cache hit ratios varied on average between 70% for a 16K cache and 91% for a 256K cache. Programs with very large working sets generate poor cache performance even with large caches. The hit ratios of individual references are measured to either 0% or 100%. By locating the references that miss, attempts to improve memory performance can focus on references where improvement is possible. The compiler can estimate the number of loop iterations which can execute without filling the cache, the overflow iteration. The overflow iteration combined with the dependence graph can be used to determine at each reference whether execution will result in hits or misses. Program transformation can be used to improve cache performance by reordering computation to move references to the same memory location closer together, thereby eliminating cache misses. Using the overflow iteration, the compiler can often do this transformation automatically. Standard blocking transformations cannot be used on many loop nests that contain transformation preventing dependences. Wavefront blocking allows any loop nest to be blocked, when the components of dependence vectors are bounded. When the cache misses cannot be eliminated, software prefetching can overlap the miss delays with computation. Software prefetching uses a special instruction to preload values into the cache. A cache load resembles a register load in structure, but does not block computation and only moves the address into cache where a later register load will be required. The compiler can inform the cache (on average) over 100 cycles before a load is required. Cache misses can be serviced in parallel with computation.

210 citations


Patent
Gregor Stephen Lee1
17 Jan 1989
TL;DR: In this paper, a hierarchical first-level and second-level memory system includes a first level store queue (18B1) for storing instructions and/or data from a processor (20B) of the multiprocessor system prior to storage in the first level of cache (18A2), a second level store queues (26A2).
Abstract: A multiprocessor system includes a system of store queues and write buffers in a hierarchical first level and second level memory system including a first level store queue (18B1) for storing instructions and/or data from a processor (20B) of the multiprocessor system prior to storage in a first level of cache (18B), a second level store queue (26A2) for storing the in­structions and/or data from the first level store queue (18B1) and a plurality of write buffers (26A2(A); 26A2(B)) for storing the instructions and/or data from the second level store queue prior to storage in a second level of cache. The multiprocessor system includes hierarchical levels of caches and write buffers. When stored in the second level write buffers, access to the shared second level cache is requested; and, when access is granted, the data and/or instruct­ions is moved from the second level write buffers to the shared second level cache. When stored in the shared second level cache, corresponding obsolete entries in the first level of cache are invalidated before any other processor "sees" the obsolete data and the new data and/or instructions are over-written in the first level of cache.

Proceedings ArticleDOI
01 Apr 1989
TL;DR: The results indicate that the benefits of the extensions to the protocols are limited, and read-broadcast reduces the number of invalidation misses, but at a high cost in processor lockout from the cache.
Abstract: Write-invalidate and write-broadcast coherency protocols have been criticized for being unable to achieve good bus performance across all cache configurations. In particular, write-invalidate performance can suffer as block size increases; and large cache sizes will hurt write-broadcast. Read-broadcast and competitive snooping extensions to the protocols have been proposed to solve each problem. Our results indicate that the benefits of the extensions are limited. Read-broadcast reduces the number of invalidation misses, but at a high cost in processor lockout from the cache. The net effect can be an increase in total execution cycles. Competitive snooping benefits only those programs with high per-processor locality Of reference to shared data. For programs characterized by inter-processor contention for shared addresses, competi- tive snooping can degrade performance by causing a slight increase in bus utilization and total execution time.

Proceedings ArticleDOI
01 Apr 1989
TL;DR: The extent to which multiple hardware contexts per processor can help to mitigate the negative effects of high latency is explored and it is shown that two or four contexts can achieve substantial performance gains over a single context.
Abstract: A fundamental problem that any scalable multiprocessor must address is the ability to tolerate high latency memory operations. This paper explores the extent to which multiple hardware contexts per processor can help to mitigate the negative effects of high latency. In particular, we evaluate the performance of a directory-based cache coherent multiprocessor using memory reference traces obtained from three parallel applications. We explore the case where there are a small fixed number (2-4) of hardware contexts per processor and the context switch overhead is low. In contrast to previously proposed approaches, we also use a very simple context switch criterion, namely a cache miss or a write-hit to shared data. Our results show that the effectiveness of multiple contexts depends on the nature of the applications, the context switch overhead, and the inherent latency of the machine architecture. Given reasonably low overhead hardware context switches, we show that two or four contexts can achieve substantial performance gains over a single context. For one application, the processor utilization increased by about 46% with two contexts and by about 80% with four contexts.

Patent
15 May 1989
TL;DR: In this paper, a bus snoop control method for maintaining coherency between a write-back cache and main memory during memory accesses by an alternate bus master is proposed.
Abstract: A bus snoop control method for maintaining coherency between a write-back cache and main memory during memory accesses by an alternate bus master. The method and apparatus incorporates an option to source `dirty` or altered data from the write-back cache to the alternate bus master during a memory read operation, and simultaneously invalidate `dirty` or altered data from the write-back cache. The method minimizes the number of cache accesses required to maintain coherency between the cache and main memory during page-out/page-in sequences initiated by the alternate bus master, thereby improving system performance.

Patent
18 Jan 1989
TL;DR: In this article, the authors present a user-oriented approach to flexible cache system design by specifying desired cache features through the setting of appropriate cache option bits, which allows a high performance cache system to be designed with few parts, at low cost and with the ability to perform with high efficiency.
Abstract: Methods and apparatus are disclosed for realizing an integrated cache unit which may be flexibly used for cache system design. The preferred embodiment of the invention comprises both a cache memory and a cache controller on a single chip. In accordance with an alternative embodiment of the invention, the cache memory may be externally located. Flexible cache system design is achieved by the specification of desired cache features through the setting of appropriate cache option bits. The disclosed methods and apparatus support this user oriented approach to flexible system design. The actual setting of option bits may be peformed under software control and allows a high performance cache system to be designed with few parts, at low cost and with the ability to perform with high efficiency.

Patent
06 Nov 1989
TL;DR: In this paper, a multiprocessor data processing system is implemented with processors, each of which may request for a temporary time the exclusive lock on an object which is stored on a data base.
Abstract: A multiprocessor data processing system is implemented with processors, each of which may request for a temporary time the exclusive lock on an object which is stored on a data base. To achieve this a lock processor synchronizes the locking and unlocking of the objects. The requesting processor directs the storage of the object from the data base into a selected high performance storage unit, where it has exclusive rights to modify or write into the object until the object is unlocked by the processor. An audit tape or disk records all modifications made to any object during a transaction. A non-volatile cache memory is inserted in the audit trail to store a before-look image of the object that resides in the high performance storage unit. Data compaction occurs by comparison of the before-look image with an after-look image to provide a difference image, which is supplied to an audit buffer that is coupled to the audit tape. The locking processor may unlock the secured object once the after-look image has been committed from either a stored version in the non-volatile cache or from a high performance main memory unit to the data base disk. The difference image and the after-look image associated with the difference image may then be stored in the non-volatile cache, and provided to the audit tape or disk and the data base disk in a sequence which is independent of the operating sequence of the requesting processor.

Journal ArticleDOI
01 Apr 1989
TL;DR: T traces of parallel programs are used to evaluate the cache and bus performance of shared memory multiprocessors, in which coherency is maintained by a write-invalidate protocol, and show that parallel programs incur substantially higher miss ratios and bus utilization than comparable uniprocessor programs.
Abstract: Bus bandwidth ultimately limits the performance, and therefore the scale, of bus-based, shared memory multiprocessors. Previous studies have extrapolated from uniprocessor measurements and simulations to estimate the performance of these machines. In this study, we use traces of parallel programs to evaluate the cache and bus performance of shared memory multiprocessors, in which coherency is maintained by a write-invalidate protocol. In particular, we analyze the effect of sharing overhead on cache miss ratio and bus utilization.Our studies show that parallel programs incur substantially higher miss ratios and bus utilization than comparable uniprocessor programs. The sharing component of these metrics proportionally increases with both cache and block size, and for some cache configurations determines both their magnitude and trend. The amount of overhead depends on the memory reference pattern to the shared data. Programs that exhibit good per-processor-locality perform better than those with fine-grain-sharing. This suggests that parallel software writers and better compiler technology can improve program performance through better memory organization of shared data.

Proceedings ArticleDOI
01 Apr 1989
TL;DR: This work proposes a class of adaptive back-off methods that do not use any extra hardware and can significantly reduce the memory traffic to synchronization variables and shows that when the number of processors participating in a barrier synchronization is small, reductions of 20 percent to over 95 percent in synchronization traffic can be achieved at no extra cost.
Abstract: Shared-memory multiprocessors commonly use shared variables for synchronization. Our simulations of real parallel applications show that large-scale cache-coherent multiprocessors suffer significant amounts of invalidation traffic due to synchronization. Large multiprocessors that do not cache synchronization variables are often more severely impacted. If this synchronization traffic is not reduced or managed adequately, synchronization references can cause severe congestion in the network. We propose a class of adaptive back-off methods that do not use any extra hardware and can significantly reduce the memory traffic to synchronization variables. These methods use synchronization state to reduce polling of synchronization variables. Our simulations show that when the number of processors participating in a barrier synchronization is small compared to the time of arrival of the processors, reductions of 20 percent to over 95 percent in synchronization traffic can be achieved at no extra cost. In other situations adaptive backoff techniques result in a tradeoff between reduced network accesses and increased processor idle time.

Patent
30 Oct 1989
TL;DR: In this article, the authors describe a coherency protocol for a multiprocessor network where every processor has its own private cache and bus interface and the network is connected via a common system bus.
Abstract: This disclosure describes a snooping coherency protocol for a multiprocessor network wherein every processor has its own private cache and bus interface means and the network is connected via a common system bus. Each processor has its own cache directory and image directory that duplicate each other non-atomically. The snooping protocol utilizes the duality of directories coupled with the non-atomicity of directory updates to maximize processor-cache availability and minimize processor-cache access times thus supporting high performance architectures.

Patent
18 Jan 1989
TL;DR: In this paper, a cache block status field is provided for each cache block to indicate the cache block's state, such as shared or exclusive, when a write hit access to the block occurs, which can be updated by either a TLB write policy field contained within a translation look-aside buffer entry, or by a second input independent of the TLB entry which may be provided from the system on a line basis.
Abstract: A computer system having a cache memory subsystem which allows flexible setting of caching policies on a page basis and a line basis. A cache block status field is provided for each cache block to indicate the cache block's state, such as shared or exclusive. The cache block status field controls whether the cache control unit operates in a write-through write mode or in a copy-back write mode when a write hit access to the block occurs. The cache block status field may be updated by either a TLB write policy field contained within a translation look-aside buffer entry which corresponds to the page of the access, or by a second input independent of the TLB entry which may be provided from the system on a line basis.

Patent
John T. Robinson1
08 Aug 1989
TL;DR: In this paper, a cache directory keeps track of which blocks are in the cache, the number of times each block in cache has been referenced after aging at least a predetermined amount (reference count), and the age of each block since the last reference to that block, for use in determining which of the cache blocks is replaced when there is a cache miss.
Abstract: A cache directory keeps track of which blocks are in the cache, the number of times each block in the cache has been referenced after aging at least a predetermined amount (reference count), and the age of each block since the last reference to that block, for use in determining which of the cache blocks is replaced when there is a cache miss. At least one preselected age boundary threshold is utilized to determine when to adjust the reference count for a given block on a cache hit and to select a cache block for replacement as a function of reference count value and block age.

Proceedings ArticleDOI
01 Apr 1989
TL;DR: Alternative implementations of associativity that use hardware similar to that used to implement a direct-mapped cache are examined, expecting these conditions to be true for caches in multiprocessors designed to reduce memory interconnection traffic, caches implemented with large, narrow memory chips, and level two (or higher) caches in a cache hierarchy.
Abstract: The traditional approach to implementing wide set- associativity is expensive, requiring a wide tag memory (directory) and many comparators. Here we examine alternative implementations of associativity that use hardware similar to that used to implement a direct-mapped cache. One approach scans tags serially from most-recently used to least-recently used. Another uses a partial compare of a few bits from each tag to reduce the number of tags that must be examined serially. The drawback of both approaches is that they increase cache access time by a factor of two or more over the traditional implementation of set- associativity, making them inappropriate for cache designs in which a fast access time is crucial (e.g. level one caches, caches directly servicing processor requests). These schemes are useful, however, if (1) the low miss ratio of wide set-associative caches is desired, (2) the low cost of a direct-mapped implementation is preferred, and (3) the slower access time of these approaches can be tolerated. We expect these conditions to be true for caches in multiprocessors designed to reduce memory interconnection traffic, caches implemented with large, narrow memory chips, and level two (or higher) caches in a cache hierarchy.

Proceedings ArticleDOI
01 Apr 1989
TL;DR: It is shown how the second-level cache can be easily extended to solve the synonym problem resulting from the use of a virtually-addressed cache at the first level and how this organization has a performance advantage over a hierarchy of physically-add addressed caches in a multiprocessor environment.
Abstract: We propose and analyze a two-level cache organization that provides high memory bandwidth. The first-level cache is accessed directly by virtual addresses. It is small, fast, and, without the burden of address translation, can easily be optimized to match the processor speed. The virtually-addressed cache is backed up by a large physically-addressed cache; this second-level cache provides a high hit ratio and greatly reduces memory traffic. We show how the second-level cache can be easily extended to solve the synonym problem resulting from the use of a virtually-addressed cache at the first level. Moreover, the second-level cache can be used to shield the virtually-addressed first-level cache from irrelevant cache coherence interference. Finally, simulation results show that this organization has a performance advantage over a hierarchy of physically-addressed caches in a multiprocessor environment.

Patent
15 May 1989
TL;DR: In this article, the authors propose a data cache controller that uses the multiple dirty bits to determine the quantity and type of accesses required to write the dirty data to memory, so as to minimize the number of memory accesses used to unload a dirty entry.
Abstract: A data cache capable of operation in a write-back (copyback) mode. The data cache design provides a mechanism for making the data cache coherent with memory, without writing the entire cache entry to memory, thereby reducing bus utilization. Each data cache entry is comprised of three items: data, a tag address, and a mixed size status field. The mixed size status fields provide one bit to indicate the validity of the data cache entry and multiple bits to indicate if the entry contains data that has not been written to memory (dirtiness). Multiple dirty bits provide a data cache controller with sufficient information to minimize the number of memory accesses used to unload a dirty entry. The data cache controller uses the multiple dirty bits to determine the quantity and type of accesses required to write the dirty data to memory. The portions of the entry being replaced that are clean (unmodified) are not written to memory.

Patent
14 Sep 1989
TL;DR: In this paper, the authors propose a scheme to optimize the amount of data to be promoted to the cache from a backing store in anticipation of future host processor references, based on the examination of a group of the tracks in a cache.
Abstract: The disclosure relates to sequential performance of a cached data storage subsystem with a minimal control signal processing. Sequential access is first detected by monitoring and examining the quantity of data accessed per unit of data storage (track) across a set of contiguously addressable tracks. Since the occupancy of the data in the cache is usually time limited, this examination provides an indication of the rate of sequential processing for a data set, i.e., a data set is being processed usually in contiguously addressable data storage units of a data storage system. Based upon the examination of a group of the tracks in a cache, the amount of data to be promoted to the cache from a backing store in anticipation of future host processor references is optimized. A promotion factor is calculated by combining the access extents monitored in the individual data storage areas and is expressed in a number of tracks units to be promoted. The examination of the group of tracks units and the implementation of the data promotion and demotion (early cast-out) is synchronized which results in a synergistic effect for increasing throughput of the cache for sequentially-processed data. A limit of promotion is determined to create a window of sequential data processing.

Journal ArticleDOI
Dominique Thiebaut1
TL;DR: Fractal geometry is proposed as a powerful measure of program behavior, and its application to the prediction of the miss ratio of programs in fully associative caches is presented.
Abstract: Fractal geometry is proposed as a powerful measure of program behavior, and its application to the prediction of the miss ratio of programs in fully associative caches is presented. Programs are modeled as one-dimensional fractal random-walks. The fractal cache model is based on the parameterization of a program trace by a small number of constants, one of which is the fractal dimension of the program. The model is validated by trace-driven simulations of several program traces. With this model, it is possible to read the trace of a program once, and then predict the behavior of the miss ratio curve of that program in fully associative caches of varying sizes. >

Patent
26 May 1989
TL;DR: In this paper, an instruction is presented to the cache; the instruction includes a cache control specifier which identifies a type of data being requested, and one of a plurality of replacement schemes is selected for swapping a data block out of the cache.
Abstract: An instruction is presented to the cache; the instruction includes a cache control specifier which identifies a type of data being requested. Based on the cache control specifier, one of a plurality of replacement schemes is selected for swapping a data block out of the cache.

Patent
19 Jun 1989
TL;DR: In this article, the linker inserts a very short stylized subroutine call to a routine that logs the reference in a large, trace buffer, and when the trace buffer fills up with recorded memory references, the contents of the buffer are processed, either by dumping the contents to an output device emptying the trace, or a cache simulation routine is run to analyze the data.
Abstract: The present invention utilizes link time code modification to instrument the code which is to be executed, typically comprising plurality of kernel operations and user programs. When the code is instrumented, wherever a data memory reference appears, the linker inserts a very short stylized subroutine call to a routine that logs the reference in a large, trace buffer. The same call is inserted at the beginning of each basic block to record instruction references. When the trace buffer fills up with recorded memory references, the contents of the buffer are processed, either by dumping the contents to an output device emptying the trace buffer, or a cache simulation routine is run to analyze the data. The results of the analysis are stored rather than storing the entire results of the tracing program.

Patent
03 Feb 1989
TL;DR: In this paper, a lock granularity is defined at the level of individual cache blocks for the CPUs and the cache blocks also represent the unit of memory allocation in the computer system, and a lock directory is defined by a plurality of lock bits so that addresses in the same block of memory are mapped to the same location in the lock directory.
Abstract: All monitoring and control of locked memory access requests in a multiprocessing computer system is handled by a system control unit (SCU) which controls the parallel operation of a plurality of central processing units (CPUs) and I/O units relative to a common main memory. Locking granularity is defined at the level of individual cache blocks for the CPUs, and the cache blocks also represent the unit of memory allocation in the computer system. The SCU is provided with a lock directory defined by a plurality of lock bits so that addresses in the same block of memory are mapped to the same location in the lock directory. Incoming lock requests for a given memory location are processed by interrogating the corresponding lock bit in the lock directory in the SCU by using the associated memory address as an index into the directory. If the lock bit is not set, the lock request is granted. The lock bit is subsequently set and maintained in that state until the unit requesting the lock has completed its memory access operation and sends an "unlock" request. If the interrogated lock bit is found to be set, the lock request is denied and the requesting port is notified of the denial. Fairness for the processing of denied lock requests is insured by a reserve list onto which denied requests are sequentially positioned on a first-come-first-served basis.

Patent
05 Jun 1989
TL;DR: In this article, a block descriptor table (40 ) is divided into a plurality of sets (42), depending upon the size of the memory cache, and each set is similarly indexed to define memory groups (44) having tag, cache address, and usage information.
Abstract: A controller (10) for use with a hard disk (38) or other mass storage medium provides a memory cache (36). A block descriptor table (40 ) is divided into a plurality of sets (42), depending upon the size of the memory cache (36). Each set is similarly indexed to define memory groups (44) having tag, cache address, and usage information. Upon a read command, an index is generated corresponding to the address requested by the host computer, and the tag information is matched with a generated tag from the address. Each set is checked until a hit occurs or a miss occurs in every set. After each miss, the usage information (50) corresponding to the memory group (44) is decremented. When reading information from the storage device (32) to the memory cache (36), the controller (10) may selectively read additional sectors. The number of sectors read from the storage device may be selectively controlled by the user or the host processor. Further, a cap may be provided to provide a maximum number of sectors to be read.