Showing papers on "Cache published in 1992"

PDF

Open Access

Symbolic model checking: an approach to the state explosion problem

[...]

01 Jan 1992

TL;DR: The symbolic model checking technique revealed subtle errors in this protocol, resulting from complex execution sequences that would occur with very low probability in random simulation runs, and an alternative method is developed for avoiding the state explosion in the case of asynchronous control circuits.

...read moreread less

Abstract: Finite state models of concurrent systems grow exponentially as the number of components of the system increases. This is known widely as the state explosion problem in automatic verification, and has limited finite state verification methods to small systems. To avoid this problem, a method called symbolic model checking is proposed and studied. This method avoids building a state graph by using Boolean formulas to represent sets and relations. A variety of properties characterized by least and greatest fixed points can be verified purely by manipulations of these formulas using Ordered Binary Decision Diagrams. Theoretically, a structural class of sequential circuits is demonstrated whose transition relations can be represented by polynomial space OBDDs, though the number of states is exponential. This result is born out by experimental results on example circuits and systems. The most complex of these is the cache consistency protocol of a commercial distributed multiprocessor. The symbolic model checking technique revealed subtle errors in this protocol, resulting from complex execution sequences that would occur with very low probability in random simulation runs. In order to model the cache protocol, a language was developed for describing sequential circuits and protocols at various levels of abstraction. This language has a synchronous dataflow semantics, but allows nondeterminism and supports interleaving processes with shared variables. A system called SMV can automatically verify programs in this language with respect to temporal logic formulas, using the symbolic model checking technique. A technique for proving properties of inductively generated classes of finite state systems is also developed. The proof is checked automatically, but requires a user supplied process called a process invariant to act as an inductive hypothesis. An invariant is developed for the distributed cache protocol, allowing properties of systems with an arbitrary number of processors to be proved. Finally, an alternative method is developed for avoiding the state explosion in the case of asynchronous control circuits. This technique is based on the unfolding of Petri nets, and is used to check for hazards in a distributed mutual exclusion circuit.

...read moreread less

1,209 citations

Journal Article•DOI•

The Stanford Dash multiprocessor

[...]

Daniel E. Lenoski¹, James Laudon¹, Kourosh Gharachorloo¹, Wolf-Dietrich Weber¹, Abhinav Gupta¹, John L. Hennessy¹, Mark Horowitz¹, Monica S. Lam¹ - Show less +4 more•Institutions (1)

Stanford University¹

01 Mar 1992-IEEE Computer

TL;DR: The directory architecture for shared memory (Dash) as discussed by the authors allows shared data to be cached, significantly reducing the latency of memory accesses and yielding higher processor utilization and higher overall performance, and a distributed directory-based protocol that provides cache coherence without compromising scalability.

...read moreread less

Abstract: The overall goals and major features of the directory architecture for shared memory (Dash) are presented. The fundamental premise behind the architecture is that it is possible to build a scalable high-performance machine with a single address space and coherent caches. The Dash architecture is scalable in that it achieves linear or near-linear performance growth as the number of processors increases from a few to a few thousand. This performance results from distributing the memory among processing nodes and using a network with scalable bandwidth to connect the nodes. The architecture allows shared data to be cached, significantly reducing the latency of memory accesses and yielding higher processor utilization and higher overall performance. A distributed directory-based protocol that provides cache coherence without compromising scalability is discussed in detail. The Dash prototype machine and the corresponding software support are described. >

...read moreread less

961 citations

Proceedings Article•DOI•

Design and evaluation of a compiler algorithm for prefetching

[...]

Todd C. Mowry, Monica S. Lam, Anoop Gupta

01 Sep 1992

TL;DR: This paper proposes a compiler algorithm to insert prefetch instructions into code that operates on dense matrices, and shows that this algorithm significantly improves the execution speed of the benchmark programs-some of the programs improve by as much as a factor of two.

...read moreread less

Abstract: Software-controlled data prefetching is a promising technique for improving the performance of the memory subsystem to match today’s high-performance processors. While prefctching is useful in hiding the latency, issuing prefetches incurs an instruction overhead and can increase the load on the memory subsystem. As a resu 1~ care must be taken to ensure that such overheads do not exceed the benefits. This paper proposes a compiler algorithm to insert prefetch instructions into code that operates on dense matrices. Our algorithm identiEes those references that are likely to be cache misses, and issues prefetches only for them. We have implemented our algorithm in the SUfF (Stanford University Intermediate Form) optimizing compiler. By generating fully functional code, we have been able to measure not only the improvements in cache miss rates, but also the oversdl performance of a simulated system. We show that our algorithm significantly improves the execution speed of our benchmark programs-some of the programs improve by as much as a factor of two. When compared to an algorithm that indiscriminately prefetches alf array accesses, our algorithm can eliminate many of the unnecessary prefetches without any significant decrease in the coverage of the cache misses.

...read moreread less

789 citations

Patent•

Access control subsystem and method for distributed computer system using locally cached authentication credentials

[...]

Edward P. Wobber, Martín Abadi, Andrew D. Birrell, Butler W. Lampson

21 Jul 1992

TL;DR: In this article, a distributed computer system has a trusted computing base that includes an authentication agent for authenticating requests received from principals at other nodes in the system, and the server process is provided with a local cache of authentication data that identifies requesters whose previous request messages have been authenticated.

...read moreread less

Abstract: A distributed computer system has a number of computers coupled thereto at distinct nodes. The computer at each node of the distributed system has a trusted computing base that includes an authentication agent for authenticating requests received from principals at other nodes in the system. Requests are transmitted to servers as messages that include a first identifier provided by the requester and a second identifier provided by the authentication agent of the requester node. Each server process is provided with a local cache of authentication data that identifies requesters whose previous request messages have been authenticated. When a request is received, the server checks the request's first and second identifiers against the entries in its local cache. If there is a match, then the request is known to be authentic. Otherwise, the server node's authentication agent is called to obtain authentication credentials from the requester's node to authenticate the request message. The principal identifier of the requester and the received credentials are stored in a local cache by the server node's authentication agent. The server process also stores a record in its local cache indicating that request messages from the specified requester are known to be authentic, thereby expediting the process of authenticating received requests.

...read moreread less

382 citations

Book Chapter•DOI•

Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes*

[...]

Anoop Gupta¹, Wolf-Dietrich Weber¹, Todd C. Mowry¹•Institutions (1)

Stanford University¹

01 Jan 1992

TL;DR: As multiprocessors are scaled beyond single bus systems, there is renewed interest in directory-based cache coherence schemes that use a limited number of pointers per directory entry to keep track of all processors caching a memory block.

...read moreread less

Abstract: As multiprocessors are scaled beyond single bus systems, there is renewed interest in directory-based cache coherence schemes. These schemes rely on a directory to keep track of all processors caching a memory block. When a write to that block occurs, point-to-point invalidation messages are sent to keep the caches coherent. A straightforward way of recording the identities of processors caching a memory block is to use a bit vector per memory block, with one bit per processor. Unfortunately, when the main memory grows linearly with the number of processors, the total size of the directory memory grows as the square of the number of processors, which is prohibitive for large machines. To remedy this problem several schemes that use a limited number of pointers per directory entry have been suggested. These schemes often cause excessive invalidation traffic.

...read moreread less

321 citations

Journal Article•DOI•

Stride directed prefetching in scalar processors

[...]

John Wai Cheong Fu¹, Janak H. Patel, Bob L. Janssens•Institutions (1)

Intel¹

10 Dec 1992

TL;DR: The results using selected programs from the PERFECT and SPEC benchmarks show that stride directed prefetching on a scalar processor can significantly reduce the cache miss rate of particular programs and a SPT need only a small number of entries to be effective.

...read moreread less

Abstract: The execution of numerically intensive programs presents a challenge to memory system designers. Numerical program execution can be accelerated by pipelined arithmetic units, bur to be effective, must be supported by high speed memory access. A cache memory is a well known hardware mechanism used to reduce the average memory access latency. Numerical programs, however, often have poor cache pei$ormance. Stride directed prefetching has been proposed to improve the cache performance of numerical programs executing on a vector processor. This paper shows how this approach can be extended to a scalar processor by using a simple hardware mechanism, called a stride prediction table (SPT), to calculate the stride distances of array accesses made from within the loop body of a program. The results using selected programs from the PERFECT and SPEC benchmark show that stride directed prefetching on a scalar processor can significantly reduce the cache miss rate of particular programs and a SPT need only a small number of entries to be effective.

...read moreread less

309 citations

Journal Article•DOI•

Page placement algorithms for large real-indexed caches

[...]

R. E. Kessler¹, Mark D. Hill¹•Institutions (1)

University of Wisconsin-Madison¹

01 Nov 1992-ACM Transactions on Computer Systems

TL;DR: This work develops several page placement algorithms, called careful-mapping algorithms, that try to select a page frame from a pool of available page frames that is likely to reduce cache contention.

...read moreread less

Abstract: When a computer system supports both paged virtual memory and large real-indexed caches, cache performance depends in part on the main memory page placement. To date, most operating systems place pages by selecting an arbitrary page frame from a pool of page frames that have been made available by the page replacement algorithm. We give a simple model that shows that this naive (arbitrary) page placement leads to up to 30% unnecessary cache conflicts. We develop several page placement algorithms, called careful-mapping algorithms, that try to select a page frame (from the pool of available page frames) that is likely to reduce cache contention. Using trace-driven simulation, we find that careful mapping results in 10–20% fewer (dynamic) cache misses than naive mapping (for a direct-mapped real-indexed multimegabyte cache). Thus, our results suggest that careful mapping by the operating system can get about half the cache miss reduction that a cache size (or associativity) doubling can.

...read moreread less

289 citations

Journal Article•DOI•

Executing compressed programs on an embedded RISC architecture

[...]

Andrew Wolfe¹, Alex Chanin•Institutions (1)

Princeton University¹

10 Dec 1992

TL;DR: A new RISC system architecture called a Compressed Code RISC Processor is presented, which depends on a code-expanding instruction cache to manage compressed programs.

...read moreread less

Abstract: The difference in code size between RISC and CISC processors appears to be a significant factor limiting the use of RISC architectures in embedded systems. Fortunately, RISC programs can be effectively compressed. An ideal solution is to design a RISC system that can directly execute compressed programs. A new RISC system architecture called a Compressed Code RISC Processor is presented. This processor depends on a code-expanding instruction cache to manage compressed programs. The compression is transparent to the processor since all instructions are executed from cache. Experimental simulations show that a significant degree of compression can be achieved from a fixed encoding scheme. The impact on system performance is slight and for some memory implementations the reduced memory bandwidth actually increases performance.

...read moreread less

288 citations

Patent•

Flash EEPROM system with erase sector select

[...]

Eliyahou Harari¹, Robert D. Norman¹, Sanjay Mehrotra¹•Institutions (1)

SanDisk¹

20 Oct 1992

TL;DR: In this paper, the authors proposed selective multiple sector erase, in which any combinations of Flash sectors may be erased together, and select sectors among the selected combination may also be de-selected during the erase operation.

...read moreread less

Abstract: A system of Flash EEprom memory chips with controlling circuits serves as non-volatile memory such as that provided by magnetic disk drives. Improvements include selective multiple sector erase, in which any combinations of Flash sectors may be erased together. Selective sectors among the selected combination may also be de-selected during the erase operation. Another improvement is the ability to remap and replace defective cells with substitute cells. The remapping is performed automatically as soon as a defective cell is detected. When the number of defects in a Flash sector becomes large, the whole sector is remapped. Yet another improvement is the use of a write cache to reduce the number of writes to the Flash EEprom memory, thereby minimizing the stress to the device from undergoing too many write/erase cycling.

...read moreread less

273 citations

Proceedings Article•DOI•

Non-volatile memory for fast, reliable file systems

[...]

Mary Baker, Satoshi Asami, Etienne Deprit, John Ouseterhout, Margo Seltzer - Show less +1 more

01 Sep 1992

TL;DR: The trace-driven simulation and analysis of two uses of NVRAM to improve I/O performance in distributed file systems are presented: non-volatile file caches on client workstations to reduce write traffic to file servers, and write buffers for write-optimized file systems to reduce server disk accesses.

...read moreread less

Abstract: Given the decreasing cost of non-volatile RAM (NVRAM), by the late 1990’s it will be feasible for most workstations to include a megabyte or more of NVRAM, enabling the design of higher-performance, more reliable systems. We present the trace-driven simulation and analysis of two uses of NVRAM to improve I/O performance in distributed file systems: non-volatile file caches on client workstations to reduce write traffic to file servers, and write buffers for write-optimized file systems to reduce server disk accesses. Our results show that a megabyte of NVRAM on diskless clients reduces the amount of file data written to the server by 40 to 50%. Increasing the amount of NVRAM shows rapidly diminishing returns, and the particular NVRAM block replacement policy makes little difference to write traffic. Closely integrating the NVRAM with the volatile cache provides the best total traffic reduction. At today’s prices, volatile memory provides a better performance improvement per dollar than NVRAM for client caching, but as volatile cache sizes increase and NVRAM becomes cheaper, NVRAM will become cost effective. On the server side, providing a one-half megabyte write-buffer per file system reduces disk accesses by about 20% on most of the measured logstructured file systems (LFS), and by 90% on one heavilyused file system that includes transaction-processing workloads.

...read moreread less

251 citations

Patent•

Multi-node cluster computer system incorporating an external coherency unit at each node to insure integrity of information stored in a shared, distributed memory

[...]

John C. Hunter, John A. Wertz

23 Dec 1992

TL;DR: In this article, a computer cluster architecture including a plurality of CPUs at each of a plurality-of- nodes is described, where each CPU has the property of coherency and includes a primary cache.

...read moreread less

Abstract: A computer cluster architecture including a plurality of CPUs at each of a plurality of nodes. Each CPU has the property of coherency and includes a primary cache. A local bus at each node couples: all the local caches, a local main memory having physical space assignable as-shared space and non-shared space and a local external coherency unit (ECU). An inter-node communication bus couples all the ECUs. Each ECU includes a monitoring section for monitoring the local and inter-node busses and a coherency section for a) responding to a non-shared cache-line request appearing on the local bus by directing the request to the non-shared space of the local memory and b) responding to a shared cache-line request appearing on the local bus by examining its coherence state to further determine if inter-node action is required to service the request and, if such action is required, transmitting a unique identifier and a coherency command to all the other ECUs. Each unit of information present in the shared space of the local memory is assigned, by the local ECU, a coherency state which may be: exclusive (the local copy of the requested information is unique in the cluster); 2) modified (the local copy has been updated by a CPU in the same node); 3) invalid (a local copy either does not exist or is known to be out-of-date); or 4) shared (the local copy is one of a plurality of current copies present in a plurality of nodes).

...read moreread less

Proceedings Article•DOI•

Reducing memory latency via non-blocking and prefetching caches

[...]

Tien-Fu Chen, Jean-Loup Baer¹•Institutions (1)

University of Washington¹

01 Sep 1992

TL;DR: A hybrid design based on the combination of non-blocking and prefetching caches is proposed, which is found to be very effective in reducing the memory latency penalty for many applications.

...read moreread less

Abstract: Non-blocking caches and prefetehing caches are two techniques for hiding memory latency by exploiting the overlap of processor computations with data accesses. A nonblocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed, thus exploiting post-miss operations, A prefetching cache generates prefetch requests to bring data in the cache before it is actually needed, thus allowing overlap with premiss computations. In this paper, we evaluate the effectiveness of these two hardware-based schemes. We propose a hybrid design based on the combination of these approaches. We also consider compiler-based optimization to enhance the effectiveness of non-blocking caches. Results from instruction level simulations on the SPEC benchmarks show that the hardware prefetching caches generally outperform nonblocking caches. Also, the relative effectiveness of nonblocklng caches is more adversely affected by an increase in memory latency than that of prefetching caches,, However, the performance of non-blocking caches can be improved substantially by compiler optimizations such as instruction scheduling and register renaming. The hybrid design cm be very effective in reducing the memory latency penalty for many applications.

...read moreread less

Journal Article•DOI•

Optimal partitioning of cache memory

[...]

Harold S. Stone¹, John Turek¹, Joel L. Wolf¹•Institutions (1)

IBM¹

01 Sep 1992-IEEE Transactions on Computers

TL;DR: An efficient combinatorial algorithm for determining the optimal steady-state allocation, which, in theory, could be used to reduce the length of the transient, is described and generalizes to multilevel cache memories.

...read moreread less

Abstract: A model for studying the optimal allocation of cache memory among two or more competing processes is developed and used to show that, for the examples studied, the least recently used (LRU) replacement strategy produces cache allocations that are very close to optimal. It is also shown that when program behavior changes, LRU replacement moves quickly toward the steady-state allocation if it is far from optimal, but converges slowly as the allocation approaches the steady-state allocation. An efficient combinatorial algorithm for determining the optimal steady-state allocation, which, in theory, could be used to reduce the length of the transient, is described. The algorithm generalizes to multilevel cache memories. For multiprogrammed systems, a cache-replacement policy better than LRU replacement is given. The policy increases the memory available to the running process until the allocation reaches a threshold time beyond which the replacement policy does not increase the cache memory allocated to the running process. >

...read moreread less

Proceedings Article•DOI•

MemSpy: analyzing memory system bottlenecks in programs

[...]

Margaret Martonosi¹, Anoop Gupta¹, Thomas Anderson²•Institutions (2)

Stanford University¹, University of California, Berkeley²

01 Jun 1992

TL;DR: MemSpy is described, a prototype tool that helps programmers identify and fix memory bottlenecks in both sequential and parallel programs and introduces the notion of data oriented, in addition to code oriented, performance tuning.

...read moreread less

Abstract: To cope with the increasing difference between processor and main memory speeds, modern computer systems use deep memory hierarchies. In the presence of such hierarchies, the performance attained by an application is largely determined by its memory reference behavior—if most references hit in the cache, the performance is significantly higher than if most references have to go to main memory. Frequently, it is possible for the programmer to restructure the data or code to achieve better memory reference behavior. Unfortunately, most existing performance debugging tools do not assist the programmer in this component of the overall performance tuning task.This paper describes MemSpy, a prototype tool that helps programmers identify and fix memory bottlenecks in both sequential and parallel programs. A key aspect of MemSpy is that it introduces the notion of data oriented, in addition to code oriented, performance tuning. Thus, for both source level code objects and data objects, MemSpy provides information such as cache miss rates, causes of cache misses, and in multiprocessors, information on cache invalidations and local versus remote memory misses. MemSpy also introduces a concise matrix presentation to allow programmers to view both code and data oriented statistics at the same time. This paper presents design and implementation issues for MemSpy, and gives a detailed case study using MemSpy to tune a parallel sparse matrix application. It shows how MemSpy helps pinpoint memory system bottlenecks, such as poor spatial locality and interference among data structures, and suggests paths for improvement.

...read moreread less

Patent•

Two-level branch prediction cache

[...]

David R. Stiles, John G. Favor, Korbin S. Van Dyke

02 Mar 1992

TL;DR: An improved branch prediction cache (BPC) as mentioned in this paper utilizes a hybrid cache structure that provides two levels of branch information caching, a shallow but wide structure (36 32-byte entries), which caches full prediction information for a limited number of branch instructions.

...read moreread less

Abstract: An improved branch prediction cache (BPC) scheme that utilizes a hybrid cache structure. The BPC provides two levels of branch information caching. The fully associative first level BPC is a shallow but wide structure (36 32-byte entries), which caches full prediction information for a limited number of branch instructions. The second direct mapped level BPC is a deep but narrow structure (256 2-byte entries), which caches only partial prediction information, but does so for a much larger number of branch instructions. As each branch instruction is fetched and decoded, its address is used to perform parallel look-ups in the two branch prediction caches.

...read moreread less

Proceedings Article•DOI•

Lattice scheduling and covert channels

[...]

W.-M. Hu

04 May 1992

TL;DR: The author discusses how this channel can be closed and the performance effects of closing the channel, and the lattice scheduler is introduced, and its use in closing the cache channel is demonstrated.

...read moreread less

Abstract: The lattice scheduler is a process scheduler that reduces the performance penalty of certain covert-channel countermeasures by scheduling processes using access class attributes. The lattice scheduler was developed as part of the covert-channel analysis of the VAX security kernel. The VAX security kernel is a virtual-machine monitor security kernel for the VAX architecture designed to meet the requirements of the A1 rating from the US National Computer Security Center. After describing the cache channel, a description is given of how this channel can be exploited using the VAX security kernel as an example. The author discusses how this channel can be closed and the performance effects of closing the channel. The lattice scheduler is introduced, and its use in closing the cache channel is demonstrated. Finally, the work illustrates the operation of the lattice scheduler through an extended example and concludes with a discussion of some variations of the basic scheduling algorithm. >

...read moreread less

Patent•

Sysplex shared data coherency method

[...]

David A. Elko¹, Jeffrey A. Frey¹, John Franklin Isenberg¹, Chandrasekaran Mohan¹, Inderpal S. Narang¹, Jeffrey M. Nick¹, Jimmy Paul Strickland¹, Michael D. Swanson¹ - Show less +4 more•Institutions (1)

IBM¹

30 Mar 1992

TL;DR: In this article, a method for controlling coherence of data elements sharable among a plurality of independently-operating central processing complexes (CPCs) in a multi-system complex (called a parallel sysplex) which contains sysplex DASDds (direct access storage devices) and a high-speed SES (shared electronic storage) facility is presented.

...read moreread less

Abstract: A method for controlling coherence of data elements sharable among a plurality of independently-operating CPCs (central processing complexes) in a multi-system complex (called a parallel sysplex) which contains sysplex DASDds (direct access storage devices) and a high-speed SES (shared electronic storage) facility. Sysplex shared data elements are stored in the sysplex DASD under a unique sysplex data element name, which is used for sysplex coherence control. Any CPC may copy any sysplex data element into a local cache buffers (LCB) in the CPC's main storage, where it has an associated sysplex validity bit. The copying CPC executes a sysplex coherence registration command which requests a SES processor to verify that the data element name already exists in the SES cache, and to store the name of the data element in a SES cache entry if found in the SES cache. Importantly, the registration command communicates to SES the CPC location of the validity bit for the LCB containing that data element copy. Each time another copy of the data element is stored in any CPC LCB, a registration command is executed to store the location of that copy's CPC validity bit into a local cache register (LCR) associated with its data element name. In this manner, each LCR accumulates all CPC locations for all LCB validity bits for all valid copies of the associated data element in the sysplex -- for maintaining data coherency throughout the sysplex.

...read moreread less

Proceedings Article•DOI•

Comparative performance evaluation of cache-coherent NUMA and COMA architectures

[...]

Per Stenström¹, Truman Joe¹, Anoop Gupta¹•Institutions (1)

Stanford University¹

01 Apr 1992

TL;DR: In this article, the authors compare the performance of CC-NUMA and COMA and show that COMA's potential for performance improvement is limited to applications where data accesses by different processors are finely interleaved in memory space and, in addition, where capacity misses dominate over coherence misses.

...read moreread less

Abstract: Two interesting variations of large-scale shared-memory machines that have recently emerged are cache-coherent non-uniform-memory-access machines (CC-NUMA) and cache-only memory architectures (COMA). They both have distributed main memory and use directory-based cache coherence. Unlike CC-NUMA, however, COMA machines automatically migrate and replicate data at the main-memory level in cache-line sized chunks. This paper compares the performance of these two classes of machines. We first present a qualitative model that shows that the relative performance is primarily determined by two factors: the relative magnitude of capacity misses versus coherence misses, and the granularity of data partitions in the application. We then present quantitative results using simulation studies for eight parallel applications (including all six applications from the SPLASH benchmark suite). We show that COMA's potential for performance improvement is limited to applications where data accesses by different processors are finely interleaved in memory space and, in addition, where capacity misses dominate over coherence misses. In other situations, for example where coherence misses dominate, COMA can actually perform worse than CC-NUMA due to increased miss latencies caused by its hierarchical directories. Finally, we propose a new architectural alternative, called COMA-F, that combines the advantages of both CC-NUMA and COMA.

...read moreread less

Journal Article•DOI•

An analytical access time model for on-chip cache memories

[...]

T. Wada¹, S. Rajan¹, S.A. Przybylski•Institutions (1)

Stanford University¹

01 Aug 1992-IEEE Journal of Solid-state Circuits

TL;DR: An analytical access time model for on-chip cache memories that shows the dependence of the cache access time on the cache parameters is described and it is shown that for given C, B, and A, optimum array configuration parameters can be used to minimize the access time.

...read moreread less

Abstract: An analytical access time model for on-chip cache memories that shows the dependence of the cache access time on the cache parameters is described. The model includes general cache parameters, such as cache size (C), block size (B), and associativity (A), and array configuration parameters that are responsible for determining the subarray aspect ratio and the number of subarrays. With this model, a large cache design space can be covered, which cannot be done using only SPICE circuit simulation within a limited time. Using the model, it is shown that for given C, B, and A, optimum array configuration parameters can be used to minimize the access time; if the optimum array parameters are used, then the optimum access time is roughly proportional to the log (cache size), and when the optimum array parameters are used, larger block size gives smaller access time, but larger associativity does not give smaller access time because of the increase of the data-bus capacitances. >

...read moreread less

Journal Article•DOI•

Improving disk cache hit-ratios through cache partitioning

[...]

Dominique Thiebaut¹, Harold S. Stone², Joel L. Wolf²•Institutions (2)

Smith College¹, IBM²

01 Jun 1992-IEEE Transactions on Computers

TL;DR: An adaptive algorithm for managing fully associative cache memories shared by several identifiable processes is presented and it is shown that such an increase in hit-ratio in a system with a heavy throughput of I/O requests can provide a significant decrease in disk response time.

...read moreread less

Abstract: An adaptive algorithm for managing fully associative cache memories shared by several identifiable processes is presented. The on-line algorithm extends an earlier model due to H.S. Stone et al. (1989) and partitions the cache storage in disjoint blocks whose sizes are determined by the locality of the processes accessing the cache. Simulation results of traces for 32-MB disk caches show a relative improvement in the overall and read hit-ratios in the range of 1% to 2% over those generated by a conventional least recently used replacement algorithm. The analysis of a queuing network model shows that such an increase in hit-ratio in a system with a heavy throughput of I/O requests can provide a significant decrease in disk response time. >

...read moreread less

Patent•

Method and apparatus for maintaining and retrieving live data in a posted write cache in case of power failure

[...]

Stephen M. Schultz, Randy D. Schneider

05 Jun 1992

TL;DR: In this article, the posted write cache is further mirrored and parity-checked to assure data validity, and a mirror test is executed to verify data validity when posted write operations are only enabled when error free operation is assured.

...read moreread less

Abstract: A host computer including a posted write cache for a disk drive system where the posted write cache includes battery backup to protect against potential loss of data in case of a power failure, and also including means for performing a method for determining if live data is present in the posted write cache upon power-up. The posted write cache is further mirrored and parity-checked to assure data validity. Performance increase is achieved since during normal operation data is written to the much faster cache and a completion indication is returned, and the data is flushed to the slower disk drive system at a more opportune time. Batteries provide power to the posted write cache in the event of a power failure. Upon subsequent power-up, a cache signature previously written in the posted write cache indicates that live data still resides in the posted write cache. If the cache signature is not present and the batteries are not fully discharged, a normal power up condition exists. If the cache signature is not present and the batteries are fully discharged, then the user is warned of possible data loss. A configuration identification code assures a proper correspondence between the posted write cache board and the disk drive system. A mirror test executed to verify data validity. Temporary and permanent error conditions are monitored so that posted write operations are only enabled when error-free operation is assured.

...read moreread less

Patent•

Multiple processor system having software for selecting shared cache entries of an associated castout class for transfer to a DASD with one I/O operation

[...]

David A. Elko¹, Jeffrey A. Frey¹, Chandrasekaran Mohan¹, Inderpal S. Narang¹, Jeffrey M. Nick¹, Jimmy Paul Strickland¹, Michael D. Swanson¹ - Show less +3 more•Institutions (1)

IBM¹

30 Mar 1992

TL;DR: In this article, a high-speed cache is shared by a plurality of independently-operating data systems in a multi-system data sharing complex, where each data system has access both to the highspeed cache and the lower-speed, secondary storage for obtaining and storing data.

...read moreread less

Abstract: A high-speed cache is shared by a plurality of independently-operating data systems in a multi-system data sharing complex. Each data system has access both to the high-speed cache and the lower-speed, secondary storage for obtaining and storing data. Management logic and the high-speed cache assures that a block of data obtained form the cache for entry into the secondary storage will be consistent with the version of the block of data in the shared cache with non-blocking serialization allowing access to a changed version in the cache while castout is being performed. Castout classes are provided to facilitate efficient movement from the shared cache to DASD.

...read moreread less

Patent•

Method and means for dynamically partitioning cache into a global and data type subcache hierarchy from a real time reference trace

[...]

Richard Lewis Mattson¹•Institutions (1)

IBM¹

04 Sep 1992

TL;DR: In this article, a method and means for dynamically partitioning an LRU cache partitioned into a global cache storing referenced objects of k different data types and k local caches storing objects of a single type is described.

...read moreread less

Abstract: A method and means is disclosed for dynamically partitioning an LRU cache partitioned into a global cache storing referenced objects of k different data types and k local caches storing objects of a single type. Referenced objects are stored in the MRU position of the global cache and overflow is managed by destaging the LRU object from the global to the local cache having the same data type. Dynamic partitioning is accomplished by recursively creating and maintaining from a trace of objects an LRU list of referenced objects and associated data structures for each subcache, creating and maintaining a multi-planar array of partition distribution data from the lists and the trace as a collection of all possible of maximum and minimum subcache sizing, optimally resizing the subcache partitions by applying a dynamic programming heuristic to the multiplanar array, and readjusting the partitions accordingly.

...read moreread less

Patent•

Ensuring write ordering under writeback cache error conditions

[...]

Rebecca L. Stamm, Ruth I. Bahar, Raymond Strouble, Nicholas D. Wade, John H. Edmondson - Show less +1 more

15 Jul 1992

TL;DR: In this paper, an error transition mode (ETM) is used to prevent cache data not owned in the cache from being accessed by the main memory after the first write request while permitting writeback of the data owned by the cache.

...read moreread less

Abstract: Writeback transactions from a processor and cache are fed to a main memory through a writeback queue, and non-writeback transactions from the processor and cache are fed to the main memory through a non-writeback queue. When a cache error is detected, an error transition mode (ETM) is entered that provides limited use of the data in the cache; a read or write request for data not owned in the cache is made to the main memory instead of the cache, even when the data is valid in the cache, although owned data is read from the cache. In ETM, when the processor makes a first write request to data not owned in the cache followed by a second write request to data owned in the cache, write data of the first write request is prevented from being received by the main memory after write data of the second request while permitting writeback of the data owned by the cache. Preferably this is done by sending the write requests from the processor through the non-writeback queue, and when a write request accesses data in a block of data owned by the cache, disowning the block of data in the cache and writing the disowned block of data back to the main memory.

...read moreread less

Patent•

Adaptive write-back method and apparatus wherein the cache system operates in a combination of write-back and write-through modes for a cache-based microprocessor system

[...]

Subir K. Ghosh, Dipankar Bhattacharya

28 May 1992

TL;DR: In this article, a dynamic determination is made on a cycle-by-cycle basis of whether data should be written to the cache with a dirty bit asserted, or the data to both the cache and main memory.

...read moreread less

Abstract: Method and apparatus for reducing the access time required to write to memory and read from memory in a computer system having a cache-based memory. A dynamic determination is made on a cycle by cycle basis of whether data should be written to the cache with a dirty bit asserted, or the data should be written to both the cache and main memory. The write-through method is chosen where the write-through method is approximately as fast as the write-back method. Where the write-back method is substantially faster than the write-through method, the write-back method is chosen.

...read moreread less

Patent•

Apparatus for reducing computer system power consumption

[...]

Lee Warren Atkinson

14 Dec 1992

TL;DR: In this article, a battery powered computer system determines when the system is not in use by monitoring various events associated with the operation of the system, such as cache read misses and write operations.

...read moreread less

Abstract: A battery powered computer system determines when the system is not in use by monitoring various events associated with the operation of the system. The system preferably monitors the number of cache read misses and write operations, i.e., the cache hit rate, and reduces the system clock frequency when the cache hit rate rises above a certain level. When the cache hit rate is above a certain level, then it can be assumed that the processor is executing a tight loop, such as when the processor is waiting for a key to be pressed and then the frequency can be reduced without affecting system performance. Alternatively, the apparatus monitors the occurrence of memory page misses, I/O write cycles or other events to determine the level of activity of the computer system.

...read moreread less

Journal Article•DOI•

Synthetic traces for trace-driven simulation of cache memories

[...]

Dominique Thiebaut¹, Joel L. Wolf², Harold S. Stone²•Institutions (2)

Smith College¹, IBM²

01 Apr 1992-IEEE Transactions on Computers

TL;DR: By comparing synthetic traces with real traces of identical locality parameters, it is demonstrated that synthetic traces exhibit miss ratios and lifetime functions that compare well with those of the real traces they mimic, both in fully associative and in set-associate memories.

...read moreread less

Abstract: Two techniques for producing synthetic address traces that produce good emulations of the locality of reference of real programs are presented. The first algorithm generates synthetic addresses by simulating a random walk in an infinite address-space with references governed by a hyperbolic probability law. The second algorithm is a refinement of the first in which the address space has a given finite size. The basic model for the random walk has two parameters that correspond to the working set size and the locality of reference. By comparing synthetic traces with real traces of identical locality parameters, it is demonstrated that synthetic traces exhibit miss ratios and lifetime functions that compare well with those of the real traces they mimic, both in fully associative and in set-associative memories. >

...read moreread less

Caching Function Results: Faster Arithmetic by Avoiding Unnecessary Computation

[...]

Stephen Richardson¹•Institutions (1)

Sun Microsystems¹

01 Sep 1992

TL;DR: Using two separate benchmark suites, the SPEC benchmarks and the Perfect Club, and concentrating on multiplication, a surprising amount of trivial and redundant operation is found.

...read moreread less

Abstract: This report introduces the notion of trivial computation, where the appearance of simple operands reduces the complexity of a potentially difficult operation. An example of a trivial operation is integer divide-by-two; the division becomes a simple shift operation. Also discussed is the concept of redundant computation, where some operation repeatedly does the same function because it repeatedly sees the same operands. Using two separate benchmark suites, the SPEC benchmarks and the Perfect Club, and concentrating on multiplication, we find a surprising amount of trivial and redundant operation. Various architectural means of exploiting this knowledge to improve computational efficiency include detection of trivial operands, memoization, and the result cache.

...read moreread less

Patent•

System and method for exclusive two-level caching

[...]

Norman P. Jouppi

21 Jan 1992

TL;DR: A simple mixed first level cache memory system as mentioned in this paper includes a level 1 cache (52) connected to a processor (54) by read data and write data lines (56) and (58).

...read moreread less

Abstract: A simple mixed first level cache memory system (50) includes a level 1 cache (52) connected to a processor (54)by read data and write data lines (56) and (58). The level 1 cache (52) is connected to level 2 cache (60) by swap tag lines (62) and (64), swap data lines (66) and (68), multiplexer (70) and swap/read Line (72). The level 2 cache (60) is connected to the next lower level in the memorv hierarchy by write tag and write data lines (74) and (76). The next lower level in the memory hierarchy below the level 2 cache (60) is also connected by a read data line (78) through the multiplexer (70) and the swap/read line (72) to the level 1 cache (52). When processor (54) requires an instruction or data, it puts out an address on lines (80). If the instruction or data is present in the level 1 cache (52), it is supplied to the processor (54) on read data line (56). If the instruction or data is not present in the level 1 cache (52), the processor looks for it in the level 2 cache (60) by putting out the address of the instruction or data on lines (80). If the instruction or data is in the level 2 cache, it is supplied to the processor (54) through the level 1 cache (52) by means of a swap operation on tag swap lines (62) and (64), swap data lines (66) and (68), multiplexer (70) and swap/read data line (72). If the instruction or data is present in neither the level 1 cache (52) nor the level 2 cache (60), the address on lines (80) fetches the instruction or data from successively lower levels in the memory hierarchy as required via read data line (78), multiplexer (70) and swap/read data line (72). The instruction or data is then supplied from the level 1 cache to the processor (54).

...read moreread less

Patent•

Method and apparatus for rapidly switching processes in a computer system

[...]

Kenneth Alan Okin¹•Institutions (1)

Sun Microsystems¹

08 May 1992

TL;DR: In this paper, an apparatus and method for switching the context of state elements of a very fast processor within a clock cycle when a cache miss occurs is presented, which is particularly useful for minimizing the average instruction cycle time for a processor with a main memory access time exceeding 15 processor clock cycles.

...read moreread less

Abstract: An apparatus and method are disclosed for switching the context of state elements of a very fast processor within a clock cycle when a cache miss occurs. To date, processors either stay idle or execute instructions out of order when they encounter cache misses. As the speed of processors become faster, the penalty for a cache miss is heavier. Having multiple copies of state elements on the processor and coupling them to a multiplexer permits the processor to save the context of the current instructions and resume executing new instructions within one clock cycle. The invention disclosed is particularly useful for minimizing the average instruction cycle time for a processor with a main memory access time exceeding 15 processor clock cycles. It is understood that the number of processes who's states are duplicated may easily be a large number n.

...read moreread less

Collapse