Showing papers on "Cache published in 1989"

PDF

Open Access

Proceedings Article•DOI•

Leases: an efficient fault-tolerant mechanism for distributed file cache consistency

[...]

Cary Gordon Gray¹, David R. Cheriton¹•Institutions (1)

01 Nov 1989

TL;DR: An analytic model and an evaluation for file access in the V system show that leases of short duration provide good performance and the impact of leases on performance grows more significant in systems of larger scale and higher processor performance.

...read moreread less

Abstract: Caching introduces the overhead and complexity of ensuring consistency, reducing some of its performance benefits. In a distributed system, caching must deal with the additional complications of communication and host failures.Leases are proposed as a time-based mechanism that provides efficient consistent access to cached data in distributed systems. Non-Byzantine failures affect performance, not correctness, with their effect minimized by short leases. An analytic model and an evaluation for file access in the V system show that leases of short duration provide good performance. The impact of leases on performance grows more significant in systems of larger scale and higher processor performance.

...read moreread less

655 citations

Journal Article•DOI•

Beating the I/O bottleneck: a case for log-structured file systems

[...]

John Ousterhout¹, Fred Douglis¹•Institutions (1)

University of California, Berkeley¹

03 Jan 1989-Operating Systems Review

TL;DR: This paper discusses several techniques for improving I/O performance, including caches, battery-backed-up caches, and cache logging, and examines in particular detail an approach called log-structured file systems, where the file system's only representation on disk is in the form of an append-only log.

...read moreread less

Abstract: CPU speeds are improving at a dramatic rate, while disk speeds are not. This technology shift suggests that many engineering and office applications may become so I/O-limited that they cannot benefit from further CPU improvements. This paper discusses several techniques for improving I/O performance, including caches, battery-backed-up caches, and cache logging. We then examine in particular detail an approach called log-structured file systems, where the file system's only representation on disk is in the form of an append-only log. Log-structured file systems potentially provide order-of-magnitude improvements in write performance. When log-structured file systems are combined with arrays of small disks (which provide high bandwidth) and large main-memory file caches (which satisfy most read accesses), we believe it will be possible to achieve 1000-fold improvements in I/O performance over today's systems.

...read moreread less

409 citations

Journal Article•DOI•

Hippocampus and memory for food caches in black-capped chickadees

[...]

David F. Sherry¹, Anthony L. Vaccarino•Institutions (1)

University of Toronto¹

01 Apr 1989-Behavioral Neuroscience

TL;DR: Black-capped chickadees and other food-storing birds recover their scattered caches by remembering the spatial locations of cache sites, and hippocampal aspiration reduced the accuracy of cache recovery by chickadee to the chance rate, but it did not reduce the amount of caching or the number of attempts to recover caches.

...read moreread less

Abstract: Black-capped chickadees and other food-storing birds recover their scattered caches by remembering the spatial locations of cache sites. Bilateral hippocampal aspiration reduced the accuracy of cache recovery by chickadees to the chance rate, but it did not reduce the amount of caching or the number of attempts to recover caches. In a second experiment, hippocampal aspiration dissociated performance of a task requiring memory for places from performance of a task requiring memory for cues associated with food, disrupting the former but not the latter

...read moreread less

367 citations

Journal Article•DOI•

An analytical cache model

[...]

Anant Agarwal¹, John L. Hennessy², Mark Horowitz²•Institutions (2)

Massachusetts Institute of Technology¹, Stanford University²

01 May 1989-ACM Transactions on Computer Systems

TL;DR: An analytical cache model is developed that gives miss rates for a given trace as a function of cache size, degree of associativity, block size, subblock size, multiprogramming level, task switch interval, and observation interval.

...read moreread less

Abstract: Trace-driven simulation and hardware measurement are the techniques most often used to obtain accurate performance figures for caches. The former requires a large amount of simulation time to evaluate each cache configuration while the latter is restricted to measurements of existing caches. An analytical cache model that uses parameters extracted from address traces of programs can efficiently provide estimates of cache performance and show the effects of varying cache parameters. By representing the factors that affect cache performance, we develop an analytical model that gives miss rates for a given trace as a function of cache size, degree of associativity, block size, subblock size, multiprogramming level, task switch interval, and observation interval. The predicted values closely approximate the results of trace-driven simulations, while requiring only a small fraction of the computation cost.

...read moreread less

345 citations

Patent•

Dual mode SIMD/MIMD processor providing reuse of MIMD instruction memories as data memories when operating in SIMD mode

[...]

Nicholas Ing-Simmons¹, Karl M. Guttag¹, Robert J. Gove¹, Keith Balmer¹•Institutions (1)

Texas Instruments¹

17 Nov 1989

TL;DR: In this paper, a multi-processor system and method arranged, in one embodiment, as an image and graphics processor is presented. But it does not address the problem of multi-processors sharing the same memory.

...read moreread less

Abstract: A multi-processor system and method arranged, in one embodiment, as an image and graphics processor. The multiprocessor system includes several individual processors all having communication links to several memories. Additional instruction memories are dedicated individually as cache memories to particular processors so that the processors can function in the multiple instruction, multiple data (MIMD) mode. When the processors function in the single instruction, multiple data mode (SIMD) the dedicated memories are reassigned for access by all of the processors for data. A crossbar switch serves to establish the processor memory links. The entire image processor, including the individual processors, the crossbar switch and the memories, is contained on a single silicon chip.

...read moreread less

288 citations

Journal Article•DOI•

Efficient synchronization primitives for large-scale cache-coherent multiprocessors

[...]

James R. Goodman¹, Mary K. Vernon¹, Philip J. Woest¹•Institutions (1)

University of Wisconsin-Madison¹

01 Apr 1989

TL;DR: A set of efficient primitives for process synchronization in multiprocessors that make use of synchronization bits to provide a simple mechanism for mutual exclusion and to implement Fetch and Add with combining in software rather than hardware is proposed.

...read moreread less

Abstract: This paper proposes a set of efficient primitives for process synchronization in multiprocessors. The only assumptions made in developing the set of primitives are that hardware combining is not implemented in the inter-connect, and (in one case) that the interconnect supports broadcast.The primitives make use of synchronization bits (syncbits) to provide a simple mechanism for mutual exclusion. The proposed implementation of the primitives includes efficient (i.e. local) busy-waiting for syncbits. In addition, a hardware-supported mechanism for maintaining a first-come first-serve queue of requests for a syncbit is proposed. This queueing mechanism allows for a very efficient implementation of, as well as fair access to, binary semaphores. We also propose to implement Fetch and Add with combining in software rather than hardware. This allows an architecture to scale to a large number of processors while avoiding the cost of hardware combining.Scenarios for common synchronization events such as work queues and barriers are presented to demonstrate the generality and ease of use of the proposed primitives. The efficient implementation of the primitives is simpler if the multiprocessor has a hardware cache-consistency protocol. To illustrate this point, we outline how the primitives would be implemented in the Multicube multiprocessor [GoWo88].

...read moreread less

277 citations

Patent•

System for reducing delay for execution subsequent to correctly predicted branch instruction using fetch information stored with each block of instructions in cache

[...]

William M. Johnson¹•Institutions (1)

Advanced Micro Devices¹

06 Jun 1989

TL;DR: In this article, a super-scaler processor with branch-prediction information is described, where each instruction cache block stored in the instruction cache memory includes branch prediction information fields in addition to instruction fields, which indicate the address of the instruction block's successor and information indicating the location of a branch instruction within an instruction block.

...read moreread less

Abstract: A super-scaler processor is disclosed wherein branch-prediction information is provided within an instruction cache memory. Each instruction cache block stored in the instruction cache memory includes branch-prediction information fields in addition to instruction fields, which indicate the address of the instruction block's successor and information indicating the location of a branch instruction within the instruction block. Thus, the next cache block can be easily fetched without waiting on a decoder or execution unit to indicate the proper fetch action to be taken for correctly predicted branching.

...read moreread less

254 citations

Proceedings Article•DOI•

Achieving High Instruction Cache Performance With An Optimizing Compiler

[...]

Wen-mei W. Hwu¹, Pohua P. Chang¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Apr 1989

TL;DR: The code performance with instruction placement optimization is shown to be stable across architectures with different instruction encoding density, and this approach achieves low cache miss ratios and low memory traffic ratios for small, fast instruction caches with little hardware overhead.

...read moreread less

Abstract: Increasing the execution power requires a high instruction issue bandwidth, and decreasing instruction encoding and applying some code improving techniques cause code expansion. Therefore, the instruction memory hierarchy performance has become an important factor of the system performance. An instruction placement algorithm has been implemented in the IMPACT-I (Illinois Microarchitecture Project using Advanced Compiler Technology - Stage I) C compiler to maximize the sequential and spatial localities, and to minimize mapping conflicts. This approach achieves low cache miss ratios and low memory traffic ratios for small, fast instruction caches with little hardware overhead. For ten realistic UNIX* programs, we report low miss ratios (average 0.5%) and low memory traffic ratios (average 8%) for a 2048-byte, direct-mapped instruction cache using 64-byte blocks. This result compares favorably with the fully associative cache results reported by other researchers. We also present the effect of cache size, block size, block sectoring, and partial loading on the cache performance. The code performance with instruction placement optimization is shown to be stable across architectures with different instruction encoding density.

...read moreread less

227 citations

Journal Article•DOI•

Program optimization for instruction caches

[...]

S. McFarling¹•Institutions (1)

Stanford University¹

01 Apr 1989

TL;DR: This paper presents an optimization algorithm for reducing instruction cache misses that uses profile information to reposition programs in memory so that a direct-mapped cache behaves much like an optimal cache with full associativity and full knowledge of the future.

...read moreread less

Abstract: This paper presents an optimization algorithm for reducing instruction cache misses. The algorithm uses profile information to reposition programs in memory so that a direct-mapped cache behaves much like an optimal cache with full associativity and full knowledge of the future. For best results, the cache should have a mechanism for excluding certain instructions designated by the compiler. This paper first presents a reduced form of the algorithm. This form is shown to produce an optimal miss rate for programs without conditionals and with a tree call graph, assuming basic blocks can be reordered at will. If conditionals are allowed, but there are no loops within conditionals, the algorithm does as well as an optimal cache for the worst case execution of the program consistent with the profile information. Next, the algorithm is extended with heuristics for general programs. The effectiveness of these heuristics are demonstrated with empirical results for a set of 10 programs for various cache sizes. The improvement depends on cache size. For a 512 word cache, miss rates for a direct-mapped instruction cache are halved. For an 8K word cache, miss rates fall by over 75%. Over a wide range of cache sizes the algorithm is as effective as increasing the cache size by a factor of 3 times. For 512 words, the algorithm generates only 32% more misses than an optimal cache. Optimized programs on a direct-mapped cache have lower miss rates than unoptimized programs on set-associative caches of the same size.

...read moreread less

217 citations

Dissertation•

Software methods for improvement of cache performance on supercomputer applications

[...]

Allan Kennedy Porterfield, Ken Kennedy

01 Jan 1989

TL;DR: Measurements of actual supercomputer cache performance has not been previously undertaken, and PFC-Sim, a program-driven event tracing facility that can simulate data cache performance of very long programs, is used to measure the performance of various cache structures.

...read moreread less

Abstract: Measurements of actual supercomputer cache performance has not been previously undertaken. PFC-Sim is a program-driven event tracing facility that can simulate data cache performance of very long programs. PFC-Sim simulates cache concurrently with program execution, allowing very long traces to be used. Programs with traces in excess of 4 billion entries have been used to measure the performance of various cache structures. PFC-Sim was used to measure the cache performance of array references in a benchmark set of supercomputer applications, RiCEPS. Data cache hit ratios varied on average between 70% for a 16K cache and 91% for a 256K cache. Programs with very large working sets generate poor cache performance even with large caches. The hit ratios of individual references are measured to either 0% or 100%. By locating the references that miss, attempts to improve memory performance can focus on references where improvement is possible. The compiler can estimate the number of loop iterations which can execute without filling the cache, the overflow iteration. The overflow iteration combined with the dependence graph can be used to determine at each reference whether execution will result in hits or misses. Program transformation can be used to improve cache performance by reordering computation to move references to the same memory location closer together, thereby eliminating cache misses. Using the overflow iteration, the compiler can often do this transformation automatically. Standard blocking transformations cannot be used on many loop nests that contain transformation preventing dependences. Wavefront blocking allows any loop nest to be blocked, when the components of dependence vectors are bounded. When the cache misses cannot be eliminated, software prefetching can overlap the miss delays with computation. Software prefetching uses a special instruction to preload values into the cache. A cache load resembles a register load in structure, but does not block computation and only moves the address into cache where a later register load will be required. The compiler can inform the cache (on average) over 100 cycles before a load is required. Cache misses can be serviced in parallel with computation.

...read moreread less

210 citations

Patent•

Store queue for a tightly coupled multiple processor configuration with two-level cache buffer storage

[...]

Gregor Stephen Lee¹•Institutions (1)

IBM¹

17 Jan 1989

TL;DR: In this paper, a hierarchical first-level and second-level memory system includes a first level store queue (18B1) for storing instructions and/or data from a processor (20B) of the multiprocessor system prior to storage in the first level of cache (18A2), a second level store queues (26A2).

...read moreread less

Abstract: A multiprocessor system includes a system of store queues and write buffers in a hierarchical first level and second level memory system including a first level store queue (18B1) for storing instructions and/or data from a processor (20B) of the multiprocessor system prior to storage in a first level of cache (18B), a second level store queue (26A2) for storing the instructions and/or data from the first level store queue (18B1) and a plurality of write buffers (26A2(A); 26A2(B)) for storing the instructions and/or data from the second level store queue prior to storage in a second level of cache. The multiprocessor system includes hierarchical levels of caches and write buffers. When stored in the second level write buffers, access to the shared second level cache is requested; and, when access is granted, the data and/or instructions is moved from the second level write buffers to the shared second level cache. When stored in the shared second level cache, corresponding obsolete entries in the first level of cache are invalidated before any other processor "sees" the obsolete data and the new data and/or instructions are over-written in the first level of cache.

...read moreread less

Proceedings Article•DOI•

Evaluating The Performance Of Four Snooping Cache Coherency Protocols

[...]

Susan J. Eggers¹, Randy H. Katz²•Institutions (2)

University of Washington¹, University of California, Berkeley²

01 Apr 1989

TL;DR: The results indicate that the benefits of the extensions to the protocols are limited, and read-broadcast reduces the number of invalidation misses, but at a high cost in processor lockout from the cache.

...read moreread less

Abstract: Write-invalidate and write-broadcast coherency protocols have been criticized for being unable to achieve good bus performance across all cache configurations. In particular, write-invalidate performance can suffer as block size increases; and large cache sizes will hurt write-broadcast. Read-broadcast and competitive snooping extensions to the protocols have been proposed to solve each problem. Our results indicate that the benefits of the extensions are limited. Read-broadcast reduces the number of invalidation misses, but at a high cost in processor lockout from the cache. The net effect can be an increase in total execution cycles. Competitive snooping benefits only those programs with high per-processor locality Of reference to shared data. For programs characterized by inter-processor contention for shared addresses, competi- tive snooping can degrade performance by causing a slight increase in bus utilization and total execution time.

...read moreread less

Proceedings Article•DOI•

Exploring The Benefits Of Multiple Hardware Contexts In A Multiprocessor Architecture: Preliminary Results

[...]

Wolf-Dietrich Weber¹, Amit Gupta¹•Institutions (1)

Stanford University¹

01 Apr 1989

TL;DR: The extent to which multiple hardware contexts per processor can help to mitigate the negative effects of high latency is explored and it is shown that two or four contexts can achieve substantial performance gains over a single context.

...read moreread less

Abstract: A fundamental problem that any scalable multiprocessor must address is the ability to tolerate high latency memory operations. This paper explores the extent to which multiple hardware contexts per processor can help to mitigate the negative effects of high latency. In particular, we evaluate the performance of a directory-based cache coherent multiprocessor using memory reference traces obtained from three parallel applications. We explore the case where there are a small fixed number (2-4) of hardware contexts per processor and the context switch overhead is low. In contrast to previously proposed approaches, we also use a very simple context switch criterion, namely a cache miss or a write-hit to shared data. Our results show that the effectiveness of multiple contexts depends on the nature of the applications, the context switch overhead, and the inherent latency of the machine architecture. Given reasonably low overhead hardware context switches, we show that two or four contexts can achieve substantial performance gains over a single context. For one application, the processor utilization increased by about 46% with two contexts and by about 80% with four contexts.

...read moreread less

Patent•

Method for data bus snooping in a data processing system by selective concurrent read and invalidate cache operation

[...]

William B. Ledbetter¹, Russell Reininger¹•Institutions (1)

Motorola¹

15 May 1989

TL;DR: In this paper, a bus snoop control method for maintaining coherency between a write-back cache and main memory during memory accesses by an alternate bus master is proposed.

...read moreread less

Abstract: A bus snoop control method for maintaining coherency between a write-back cache and main memory during memory accesses by an alternate bus master. The method and apparatus incorporates an option to source `dirty` or altered data from the write-back cache to the alternate bus master during a memory read operation, and simultaneously invalidate `dirty` or altered data from the write-back cache. The method minimizes the number of cache accesses required to maintain coherency between the cache and main memory during page-out/page-in sequences initiated by the alternate bus master, thereby improving system performance.

...read moreread less

Patent•

Organization of an integrated cache unit for flexible usage in cache system design

[...]

Gigy Baror¹•Institutions (1)

Advanced Micro Devices¹

18 Jan 1989

TL;DR: In this article, the authors present a user-oriented approach to flexible cache system design by specifying desired cache features through the setting of appropriate cache option bits, which allows a high performance cache system to be designed with few parts, at low cost and with the ability to perform with high efficiency.

...read moreread less

Abstract: Methods and apparatus are disclosed for realizing an integrated cache unit which may be flexibly used for cache system design. The preferred embodiment of the invention comprises both a cache memory and a cache controller on a single chip. In accordance with an alternative embodiment of the invention, the cache memory may be externally located. Flexible cache system design is achieved by the specification of desired cache features through the setting of appropriate cache option bits. The disclosed methods and apparatus support this user oriented approach to flexible system design. The actual setting of option bits may be peformed under software control and allows a high performance cache system to be designed with few parts, at low cost and with the ability to perform with high efficiency.

...read moreread less

Patent•

Cache memory with data compaction for use in the audit trail of a data processing system having record locking capabilities

[...]

Donald T. Bordsen¹, Thomas P. Cooper¹, Robert F. Esson¹, Michael J. Hill¹, John R. Jordan¹, Joseph E. Kessler¹, Dennis R. Konrad¹, Ralph E. Sipple¹, Robert E. Swenson¹, James F. Torgerson¹, Anthony P. vonArx¹ - Show less +7 more•Institutions (1)

Unisys¹

06 Nov 1989

TL;DR: In this paper, a multiprocessor data processing system is implemented with processors, each of which may request for a temporary time the exclusive lock on an object which is stored on a data base.

...read moreread less

Abstract: A multiprocessor data processing system is implemented with processors, each of which may request for a temporary time the exclusive lock on an object which is stored on a data base. To achieve this a lock processor synchronizes the locking and unlocking of the objects. The requesting processor directs the storage of the object from the data base into a selected high performance storage unit, where it has exclusive rights to modify or write into the object until the object is unlocked by the processor. An audit tape or disk records all modifications made to any object during a transaction. A non-volatile cache memory is inserted in the audit trail to store a before-look image of the object that resides in the high performance storage unit. Data compaction occurs by comparison of the before-look image with an after-look image to provide a difference image, which is supplied to an audit buffer that is coupled to the audit tape. The locking processor may unlock the secured object once the after-look image has been committed from either a stored version in the non-volatile cache or from a high performance main memory unit to the data base disk. The difference image and the after-look image associated with the difference image may then be stored in the non-volatile cache, and provided to the audit tape or disk and the data base disk in a sequence which is independent of the operating sequence of the requesting processor.

...read moreread less

Journal Article•DOI•

The effect of sharing on the cache and bus performance of parallel programs

[...]

Susan J. Eggers¹, Randy H. Katz²•Institutions (2)

University of Washington¹, University of California, Berkeley²

01 Apr 1989

TL;DR: T traces of parallel programs are used to evaluate the cache and bus performance of shared memory multiprocessors, in which coherency is maintained by a write-invalidate protocol, and show that parallel programs incur substantially higher miss ratios and bus utilization than comparable uniprocessor programs.

...read moreread less

Abstract: Bus bandwidth ultimately limits the performance, and therefore the scale, of bus-based, shared memory multiprocessors. Previous studies have extrapolated from uniprocessor measurements and simulations to estimate the performance of these machines. In this study, we use traces of parallel programs to evaluate the cache and bus performance of shared memory multiprocessors, in which coherency is maintained by a write-invalidate protocol. In particular, we analyze the effect of sharing overhead on cache miss ratio and bus utilization.Our studies show that parallel programs incur substantially higher miss ratios and bus utilization than comparable uniprocessor programs. The sharing component of these metrics proportionally increases with both cache and block size, and for some cache configurations determines both their magnitude and trend. The amount of overhead depends on the memory reference pattern to the shared data. Programs that exhibit good per-processor-locality perform better than those with fine-grain-sharing. This suggests that parallel software writers and better compiler technology can improve program performance through better memory organization of shared data.

...read moreread less

Proceedings Article•DOI•

Adaptive Backoff Synchronization Techniques

[...]

Anant Agarwal¹, M. Cherian¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Apr 1989

TL;DR: This work proposes a class of adaptive back-off methods that do not use any extra hardware and can significantly reduce the memory traffic to synchronization variables and shows that when the number of processors participating in a barrier synchronization is small, reductions of 20 percent to over 95 percent in synchronization traffic can be achieved at no extra cost.

...read moreread less

Abstract: Shared-memory multiprocessors commonly use shared variables for synchronization. Our simulations of real parallel applications show that large-scale cache-coherent multiprocessors suffer significant amounts of invalidation traffic due to synchronization. Large multiprocessors that do not cache synchronization variables are often more severely impacted. If this synchronization traffic is not reduced or managed adequately, synchronization references can cause severe congestion in the network. We propose a class of adaptive back-off methods that do not use any extra hardware and can significantly reduce the memory traffic to synchronization variables. These methods use synchronization state to reduce polling of synchronization variables. Our simulations show that when the number of processors participating in a barrier synchronization is small compared to the time of arrival of the processors, reductions of 20 percent to over 95 percent in synchronization traffic can be achieved at no extra cost. In other situations adaptive backoff techniques result in a tradeoff between reduced network accesses and increased processor idle time.

...read moreread less

Patent•

Hardware implemented cache coherency protocole with duplicated distributed directories for high-performance multiprocessors

[...]

Sanjay Shanker Mathur¹, John Susanna Fernando¹•Institutions (1)

Unisys¹

30 Oct 1989

TL;DR: In this article, the authors describe a coherency protocol for a multiprocessor network where every processor has its own private cache and bus interface and the network is connected via a common system bus.

...read moreread less

Abstract: This disclosure describes a snooping coherency protocol for a multiprocessor network wherein every processor has its own private cache and bus interface means and the network is connected via a common system bus. Each processor has its own cache directory and image directory that duplicate each other non-atomically. The snooping protocol utilizes the duality of directories coupled with the non-atomicity of directory updates to maximize processor-cache availability and minimize processor-cache access times thus supporting high performance architectures.

...read moreread less

Patent•

Organization of an integrated cache unit for flexible usage in supporting microprocessor operations

[...]

Gigy Baror¹•Institutions (1)

Advanced Micro Devices¹

18 Jan 1989

TL;DR: In this paper, a cache block status field is provided for each cache block to indicate the cache block's state, such as shared or exclusive, when a write hit access to the block occurs, which can be updated by either a TLB write policy field contained within a translation look-aside buffer entry, or by a second input independent of the TLB entry which may be provided from the system on a line basis.

...read moreread less

Abstract: A computer system having a cache memory subsystem which allows flexible setting of caching policies on a page basis and a line basis. A cache block status field is provided for each cache block to indicate the cache block's state, such as shared or exclusive. The cache block status field controls whether the cache control unit operates in a write-through write mode or in a copy-back write mode when a write hit access to the block occurs. The cache block status field may be updated by either a TLB write policy field contained within a translation look-aside buffer entry which corresponds to the page of the access, or by a second input independent of the TLB entry which may be provided from the system on a line basis.

...read moreread less

Patent•

Data cache using dynamic frequency based replacement and boundary criteria

[...]

John T. Robinson¹•Institutions (1)

IBM¹

08 Aug 1989

TL;DR: In this paper, a cache directory keeps track of which blocks are in the cache, the number of times each block in cache has been referenced after aging at least a predetermined amount (reference count), and the age of each block since the last reference to that block, for use in determining which of the cache blocks is replaced when there is a cache miss.

...read moreread less

Abstract: A cache directory keeps track of which blocks are in the cache, the number of times each block in the cache has been referenced after aging at least a predetermined amount (reference count), and the age of each block since the last reference to that block, for use in determining which of the cache blocks is replaced when there is a cache miss. At least one preselected age boundary threshold is utilized to determine when to adjust the reference count for a given block on a cache hit and to select a cache block for replacement as a function of reference count value and block age.

...read moreread less

Proceedings Article•DOI•

Inexpensive Implementations Of Set-Associativity

[...]

R. E. Kessler¹, R. Jooss¹, Alvin R. Lebeck¹, Mark D. Hill¹•Institutions (1)

University of Wisconsin-Madison¹

01 Apr 1989

TL;DR: Alternative implementations of associativity that use hardware similar to that used to implement a direct-mapped cache are examined, expecting these conditions to be true for caches in multiprocessors designed to reduce memory interconnection traffic, caches implemented with large, narrow memory chips, and level two (or higher) caches in a cache hierarchy.

...read moreread less

Abstract: The traditional approach to implementing wide set- associativity is expensive, requiring a wide tag memory (directory) and many comparators. Here we examine alternative implementations of associativity that use hardware similar to that used to implement a direct-mapped cache. One approach scans tags serially from most-recently used to least-recently used. Another uses a partial compare of a few bits from each tag to reduce the number of tags that must be examined serially. The drawback of both approaches is that they increase cache access time by a factor of two or more over the traditional implementation of set- associativity, making them inappropriate for cache designs in which a fast access time is crucial (e.g. level one caches, caches directly servicing processor requests). These schemes are useful, however, if (1) the low miss ratio of wide set-associative caches is desired, (2) the low cost of a direct-mapped implementation is preferred, and (3) the slower access time of these approaches can be tolerated. We expect these conditions to be true for caches in multiprocessors designed to reduce memory interconnection traffic, caches implemented with large, narrow memory chips, and level two (or higher) caches in a cache hierarchy.

...read moreread less

Proceedings Article•DOI•

Organization And Performance Of A Two-level Virtual-real Cache Hierarchy

[...]

Wen-Hann Wang¹, Jean-Loup Baer¹, Henry M. Levy¹•Institutions (1)

University of Washington¹

01 Apr 1989

TL;DR: It is shown how the second-level cache can be easily extended to solve the synonym problem resulting from the use of a virtually-addressed cache at the first level and how this organization has a performance advantage over a hierarchy of physically-add addressed caches in a multiprocessor environment.

...read moreread less

Abstract: We propose and analyze a two-level cache organization that provides high memory bandwidth. The first-level cache is accessed directly by virtual addresses. It is small, fast, and, without the burden of address translation, can easily be optimized to match the processor speed. The virtually-addressed cache is backed up by a large physically-addressed cache; this second-level cache provides a high hit ratio and greatly reduces memory traffic. We show how the second-level cache can be easily extended to solve the synonym problem resulting from the use of a virtually-addressed cache at the first level. Moreover, the second-level cache can be used to shield the virtually-addressed first-level cache from irrelevant cache coherence interference. Finally, simulation results show that this organization has a performance advantage over a hierarchy of physically-addressed caches in a multiprocessor environment.

...read moreread less

Patent•

System for transferring selected data words between main memory and cache with multiple data words and multiple dirty bits for each address

[...]

Robin W. Edenfield¹, William B. Ledbetter¹, Russell Reininger¹•Institutions (1)

Motorola¹

15 May 1989

TL;DR: In this article, the authors propose a data cache controller that uses the multiple dirty bits to determine the quantity and type of accesses required to write the dirty data to memory, so as to minimize the number of memory accesses used to unload a dirty entry.

...read moreread less

Abstract: A data cache capable of operation in a write-back (copyback) mode. The data cache design provides a mechanism for making the data cache coherent with memory, without writing the entire cache entry to memory, thereby reducing bus utilization. Each data cache entry is comprised of three items: data, a tag address, and a mixed size status field. The mixed size status fields provide one bit to indicate the validity of the data cache entry and multiple bits to indicate if the entry contains data that has not been written to memory (dirtiness). Multiple dirty bits provide a data cache controller with sufficient information to minimize the number of memory accesses used to unload a dirty entry. The data cache controller uses the multiple dirty bits to determine the quantity and type of accesses required to write the dirty data to memory. The portions of the entry being replaced that are clean (unmodified) are not written to memory.

...read moreread less

Patent•

Sequentially processing data in a cached data storage system

[...]

Gerald Ellsworth Tayler¹, Robert E Wagner¹•Institutions (1)

IBM¹

14 Sep 1989

TL;DR: In this paper, the authors propose a scheme to optimize the amount of data to be promoted to the cache from a backing store in anticipation of future host processor references, based on the examination of a group of the tracks in a cache.

...read moreread less

Abstract: The disclosure relates to sequential performance of a cached data storage subsystem with a minimal control signal processing. Sequential access is first detected by monitoring and examining the quantity of data accessed per unit of data storage (track) across a set of contiguously addressable tracks. Since the occupancy of the data in the cache is usually time limited, this examination provides an indication of the rate of sequential processing for a data set, i.e., a data set is being processed usually in contiguously addressable data storage units of a data storage system. Based upon the examination of a group of the tracks in a cache, the amount of data to be promoted to the cache from a backing store in anticipation of future host processor references is optimized. A promotion factor is calculated by combining the access extents monitored in the individual data storage areas and is expressed in a number of tracks units to be promoted. The examination of the group of tracks units and the implementation of the data promotion and demotion (early cast-out) is synchronized which results in a synergistic effect for increasing throughput of the cache for sequentially-processed data. A limit of promotion is determined to create a window of sequential data processing.

...read moreread less

Journal Article•DOI•

On the fractal dimension of computer programs and its application to the prediction of the cache miss ratio

[...]

Dominique Thiebaut¹•Institutions (1)

Smith College¹

01 Jul 1989-IEEE Transactions on Computers

TL;DR: Fractal geometry is proposed as a powerful measure of program behavior, and its application to the prediction of the miss ratio of programs in fully associative caches is presented.

...read moreread less

Abstract: Fractal geometry is proposed as a powerful measure of program behavior, and its application to the prediction of the miss ratio of programs in fully associative caches is presented. Programs are modeled as one-dimensional fractal random-walks. The fractal cache model is based on the parameterization of a program trace by a small number of constants, one of which is the fractal dimension of the program. The model is validated by trace-driven simulations of several program traces. With this model, it is possible to read the trace of a program once, and then predict the behavior of the miss ratio curve of that program in fully associative caches of varying sizes. >

...read moreread less

Patent•

Cache memory with variable fetch and replacement schemes

[...]

Allen J. Baum¹, William R. Bryg¹, Michael J Mahon¹, Ruby B. Lee¹, Steve S. Muchnick¹ - Show less +1 more•Institutions (1)

Hewlett-Packard¹

26 May 1989

TL;DR: In this paper, an instruction is presented to the cache; the instruction includes a cache control specifier which identifies a type of data being requested, and one of a plurality of replacement schemes is selected for swapping a data block out of the cache.

...read moreread less

Abstract: An instruction is presented to the cache; the instruction includes a cache control specifier which identifies a type of data being requested. Based on the cache control specifier, one of a plurality of replacement schemes is selected for swapping a data block out of the cache.

...read moreread less

Patent•

Method for quickly acquiring and using very long traces of mixed system and user memory references

[...]

Anita Borg, David W. Wall

19 Jun 1989

TL;DR: In this article, the linker inserts a very short stylized subroutine call to a routine that logs the reference in a large, trace buffer, and when the trace buffer fills up with recorded memory references, the contents of the buffer are processed, either by dumping the contents to an output device emptying the trace, or a cache simulation routine is run to analyze the data.

...read moreread less

Abstract: The present invention utilizes link time code modification to instrument the code which is to be executed, typically comprising plurality of kernel operations and user programs. When the code is instrumented, wherever a data memory reference appears, the linker inserts a very short stylized subroutine call to a routine that logs the reference in a large, trace buffer. The same call is inserted at the beginning of each basic block to record instruction references. When the trace buffer fills up with recorded memory references, the contents of the buffer are processed, either by dumping the contents to an output device emptying the trace buffer, or a cache simulation routine is run to analyze the data. The results of the analysis are stored rather than storing the entire results of the tracing program.

...read moreread less

Patent•

Synchronizing and processing of memory access operations in multiprocessor systems using a directory of lock bits

[...]

Scott Arnold, James Kann, Stephen J. Delahunt, Tryggve Fossum

03 Feb 1989

TL;DR: In this paper, a lock granularity is defined at the level of individual cache blocks for the CPUs and the cache blocks also represent the unit of memory allocation in the computer system, and a lock directory is defined by a plurality of lock bits so that addresses in the same block of memory are mapped to the same location in the lock directory.

...read moreread less

Abstract: All monitoring and control of locked memory access requests in a multiprocessing computer system is handled by a system control unit (SCU) which controls the parallel operation of a plurality of central processing units (CPUs) and I/O units relative to a common main memory. Locking granularity is defined at the level of individual cache blocks for the CPUs, and the cache blocks also represent the unit of memory allocation in the computer system. The SCU is provided with a lock directory defined by a plurality of lock bits so that addresses in the same block of memory are mapped to the same location in the lock directory. Incoming lock requests for a given memory location are processed by interrogating the corresponding lock bit in the lock directory in the SCU by using the associated memory address as an index into the directory. If the lock bit is not set, the lock request is granted. The lock bit is subsequently set and maintained in that state until the unit requesting the lock has completed its memory access operation and sends an "unlock" request. If the interrogated lock bit is found to be set, the lock request is denied and the requesting port is notified of the denial. Fairness for the processing of denied lock requests is insured by a reserve list onto which denied requests are sequentially positioned on a first-come-first-served basis.

...read moreread less

Patent•

Disk controller includes cache memory and a local processor which limits data transfers from memory to cache in accordance with a maximum look ahead parameter

[...]

Theodore E. Weber, Paul V. Tischler

05 Jun 1989

TL;DR: In this article, a block descriptor table (40 ) is divided into a plurality of sets (42), depending upon the size of the memory cache, and each set is similarly indexed to define memory groups (44) having tag, cache address, and usage information.

...read moreread less

Abstract: A controller (10) for use with a hard disk (38) or other mass storage medium provides a memory cache (36). A block descriptor table (40 ) is divided into a plurality of sets (42), depending upon the size of the memory cache (36). Each set is similarly indexed to define memory groups (44) having tag, cache address, and usage information. Upon a read command, an index is generated corresponding to the address requested by the host computer, and the tag information is matched with a generated tag from the address. Each set is checked until a hit occurs or a miss occurs in every set. After each miss, the usage information (50) corresponding to the memory group (44) is decremented. When reading information from the storage device (32) to the memory cache (36), the controller (10) may selectively read additional sectors. The number of sectors read from the storage device may be selectively controlled by the user or the host processor. Further, a cap may be provided to provide a maximum number of sectors to be read.

...read moreread less

Collapse