scispace - formally typeset
Search or ask a question

Showing papers on "Cache published in 1992"


01 Jan 1992
TL;DR: The symbolic model checking technique revealed subtle errors in this protocol, resulting from complex execution sequences that would occur with very low probability in random simulation runs, and an alternative method is developed for avoiding the state explosion in the case of asynchronous control circuits.
Abstract: Finite state models of concurrent systems grow exponentially as the number of components of the system increases. This is known widely as the state explosion problem in automatic verification, and has limited finite state verification methods to small systems. To avoid this problem, a method called symbolic model checking is proposed and studied. This method avoids building a state graph by using Boolean formulas to represent sets and relations. A variety of properties characterized by least and greatest fixed points can be verified purely by manipulations of these formulas using Ordered Binary Decision Diagrams. Theoretically, a structural class of sequential circuits is demonstrated whose transition relations can be represented by polynomial space OBDDs, though the number of states is exponential. This result is born out by experimental results on example circuits and systems. The most complex of these is the cache consistency protocol of a commercial distributed multiprocessor. The symbolic model checking technique revealed subtle errors in this protocol, resulting from complex execution sequences that would occur with very low probability in random simulation runs. In order to model the cache protocol, a language was developed for describing sequential circuits and protocols at various levels of abstraction. This language has a synchronous dataflow semantics, but allows nondeterminism and supports interleaving processes with shared variables. A system called SMV can automatically verify programs in this language with respect to temporal logic formulas, using the symbolic model checking technique. A technique for proving properties of inductively generated classes of finite state systems is also developed. The proof is checked automatically, but requires a user supplied process called a process invariant to act as an inductive hypothesis. An invariant is developed for the distributed cache protocol, allowing properties of systems with an arbitrary number of processors to be proved. Finally, an alternative method is developed for avoiding the state explosion in the case of asynchronous control circuits. This technique is based on the unfolding of Petri nets, and is used to check for hazards in a distributed mutual exclusion circuit.

1,209 citations


Journal ArticleDOI
TL;DR: The directory architecture for shared memory (Dash) as discussed by the authors allows shared data to be cached, significantly reducing the latency of memory accesses and yielding higher processor utilization and higher overall performance, and a distributed directory-based protocol that provides cache coherence without compromising scalability.
Abstract: The overall goals and major features of the directory architecture for shared memory (Dash) are presented. The fundamental premise behind the architecture is that it is possible to build a scalable high-performance machine with a single address space and coherent caches. The Dash architecture is scalable in that it achieves linear or near-linear performance growth as the number of processors increases from a few to a few thousand. This performance results from distributing the memory among processing nodes and using a network with scalable bandwidth to connect the nodes. The architecture allows shared data to be cached, significantly reducing the latency of memory accesses and yielding higher processor utilization and higher overall performance. A distributed directory-based protocol that provides cache coherence without compromising scalability is discussed in detail. The Dash prototype machine and the corresponding software support are described. >

961 citations


Proceedings ArticleDOI
01 Sep 1992
TL;DR: This paper proposes a compiler algorithm to insert prefetch instructions into code that operates on dense matrices, and shows that this algorithm significantly improves the execution speed of the benchmark programs-some of the programs improve by as much as a factor of two.
Abstract: Software-controlled data prefetching is a promising technique for improving the performance of the memory subsystem to match today’s high-performance processors. While prefctching is useful in hiding the latency, issuing prefetches incurs an instruction overhead and can increase the load on the memory subsystem. As a resu 1~ care must be taken to ensure that such overheads do not exceed the benefits. This paper proposes a compiler algorithm to insert prefetch instructions into code that operates on dense matrices. Our algorithm identiEes those references that are likely to be cache misses, and issues prefetches only for them. We have implemented our algorithm in the SUfF (Stanford University Intermediate Form) optimizing compiler. By generating fully functional code, we have been able to measure not only the improvements in cache miss rates, but also the oversdl performance of a simulated system. We show that our algorithm significantly improves the execution speed of our benchmark programs-some of the programs improve by as much as a factor of two. When compared to an algorithm that indiscriminately prefetches alf array accesses, our algorithm can eliminate many of the unnecessary prefetches without any significant decrease in the coverage of the cache misses.

789 citations


Patent
21 Jul 1992
TL;DR: In this article, a distributed computer system has a trusted computing base that includes an authentication agent for authenticating requests received from principals at other nodes in the system, and the server process is provided with a local cache of authentication data that identifies requesters whose previous request messages have been authenticated.
Abstract: A distributed computer system has a number of computers coupled thereto at distinct nodes. The computer at each node of the distributed system has a trusted computing base that includes an authentication agent for authenticating requests received from principals at other nodes in the system. Requests are transmitted to servers as messages that include a first identifier provided by the requester and a second identifier provided by the authentication agent of the requester node. Each server process is provided with a local cache of authentication data that identifies requesters whose previous request messages have been authenticated. When a request is received, the server checks the request's first and second identifiers against the entries in its local cache. If there is a match, then the request is known to be authentic. Otherwise, the server node's authentication agent is called to obtain authentication credentials from the requester's node to authenticate the request message. The principal identifier of the requester and the received credentials are stored in a local cache by the server node's authentication agent. The server process also stores a record in its local cache indicating that request messages from the specified requester are known to be authentic, thereby expediting the process of authenticating received requests.

382 citations


Book ChapterDOI
01 Jan 1992
TL;DR: As multiprocessors are scaled beyond single bus systems, there is renewed interest in directory-based cache coherence schemes that use a limited number of pointers per directory entry to keep track of all processors caching a memory block.
Abstract: As multiprocessors are scaled beyond single bus systems, there is renewed interest in directory-based cache coherence schemes. These schemes rely on a directory to keep track of all processors caching a memory block. When a write to that block occurs, point-to-point invalidation messages are sent to keep the caches coherent. A straightforward way of recording the identities of processors caching a memory block is to use a bit vector per memory block, with one bit per processor. Unfortunately, when the main memory grows linearly with the number of processors, the total size of the directory memory grows as the square of the number of processors, which is prohibitive for large machines. To remedy this problem several schemes that use a limited number of pointers per directory entry have been suggested. These schemes often cause excessive invalidation traffic.

321 citations


Journal ArticleDOI
10 Dec 1992
TL;DR: The results using selected programs from the PERFECT and SPEC benchmarks show that stride directed prefetching on a scalar processor can significantly reduce the cache miss rate of particular programs and a SPT need only a small number of entries to be effective.
Abstract: The execution of numerically intensive programs presents a challenge to memory system designers. Numerical program execution can be accelerated by pipelined arithmetic units, bur to be effective, must be supported by high speed memory access. A cache memory is a well known hardware mechanism used to reduce the average memory access latency. Numerical programs, however, often have poor cache pei$ormance. Stride directed prefetching has been proposed to improve the cache performance of numerical programs executing on a vector processor. This paper shows how this approach can be extended to a scalar processor by using a simple hardware mechanism, called a stride prediction table (SPT), to calculate the stride distances of array accesses made from within the loop body of a program. The results using selected programs from the PERFECT and SPEC benchmark show that stride directed prefetching on a scalar processor can significantly reduce the cache miss rate of particular programs and a SPT need only a small number of entries to be effective.

309 citations


Journal ArticleDOI
TL;DR: This work develops several page placement algorithms, called careful-mapping algorithms, that try to select a page frame from a pool of available page frames that is likely to reduce cache contention.
Abstract: When a computer system supports both paged virtual memory and large real-indexed caches, cache performance depends in part on the main memory page placement. To date, most operating systems place pages by selecting an arbitrary page frame from a pool of page frames that have been made available by the page replacement algorithm. We give a simple model that shows that this naive (arbitrary) page placement leads to up to 30% unnecessary cache conflicts. We develop several page placement algorithms, called careful-mapping algorithms, that try to select a page frame (from the pool of available page frames) that is likely to reduce cache contention. Using trace-driven simulation, we find that careful mapping results in 10–20% fewer (dynamic) cache misses than naive mapping (for a direct-mapped real-indexed multimegabyte cache). Thus, our results suggest that careful mapping by the operating system can get about half the cache miss reduction that a cache size (or associativity) doubling can.

289 citations


Journal ArticleDOI
10 Dec 1992
TL;DR: A new RISC system architecture called a Compressed Code RISC Processor is presented, which depends on a code-expanding instruction cache to manage compressed programs.
Abstract: The difference in code size between RISC and CISC processors appears to be a significant factor limiting the use of RISC architectures in embedded systems. Fortunately, RISC programs can be effectively compressed. An ideal solution is to design a RISC system that can directly execute compressed programs. A new RISC system architecture called a Compressed Code RISC Processor is presented. This processor depends on a code-expanding instruction cache to manage compressed programs. The compression is transparent to the processor since all instructions are executed from cache. Experimental simulations show that a significant degree of compression can be achieved from a fixed encoding scheme. The impact on system performance is slight and for some memory implementations the reduced memory bandwidth actually increases performance.

288 citations


Patent
20 Oct 1992
TL;DR: In this paper, the authors proposed selective multiple sector erase, in which any combinations of Flash sectors may be erased together, and select sectors among the selected combination may also be de-selected during the erase operation.
Abstract: A system of Flash EEprom memory chips with controlling circuits serves as non-volatile memory such as that provided by magnetic disk drives. Improvements include selective multiple sector erase, in which any combinations of Flash sectors may be erased together. Selective sectors among the selected combination may also be de-selected during the erase operation. Another improvement is the ability to remap and replace defective cells with substitute cells. The remapping is performed automatically as soon as a defective cell is detected. When the number of defects in a Flash sector becomes large, the whole sector is remapped. Yet another improvement is the use of a write cache to reduce the number of writes to the Flash EEprom memory, thereby minimizing the stress to the device from undergoing too many write/erase cycling.

273 citations


Proceedings ArticleDOI
01 Sep 1992
TL;DR: The trace-driven simulation and analysis of two uses of NVRAM to improve I/O performance in distributed file systems are presented: non-volatile file caches on client workstations to reduce write traffic to file servers, and write buffers for write-optimized file systems to reduce server disk accesses.
Abstract: Given the decreasing cost of non-volatile RAM (NVRAM), by the late 1990’s it will be feasible for most workstations to include a megabyte or more of NVRAM, enabling the design of higher-performance, more reliable systems. We present the trace-driven simulation and analysis of two uses of NVRAM to improve I/O performance in distributed file systems: non-volatile file caches on client workstations to reduce write traffic to file servers, and write buffers for write-optimized file systems to reduce server disk accesses. Our results show that a megabyte of NVRAM on diskless clients reduces the amount of file data written to the server by 40 to 50%. Increasing the amount of NVRAM shows rapidly diminishing returns, and the particular NVRAM block replacement policy makes little difference to write traffic. Closely integrating the NVRAM with the volatile cache provides the best total traffic reduction. At today’s prices, volatile memory provides a better performance improvement per dollar than NVRAM for client caching, but as volatile cache sizes increase and NVRAM becomes cheaper, NVRAM will become cost effective. On the server side, providing a one-half megabyte write-buffer per file system reduces disk accesses by about 20% on most of the measured logstructured file systems (LFS), and by 90% on one heavilyused file system that includes transaction-processing workloads.

251 citations


Patent
23 Dec 1992
TL;DR: In this article, a computer cluster architecture including a plurality of CPUs at each of a plurality-of- nodes is described, where each CPU has the property of coherency and includes a primary cache.
Abstract: A computer cluster architecture including a plurality of CPUs at each of a plurality of nodes. Each CPU has the property of coherency and includes a primary cache. A local bus at each node couples: all the local caches, a local main memory having physical space assignable as-shared space and non-shared space and a local external coherency unit (ECU). An inter-node communication bus couples all the ECUs. Each ECU includes a monitoring section for monitoring the local and inter-node busses and a coherency section for a) responding to a non-shared cache-line request appearing on the local bus by directing the request to the non-shared space of the local memory and b) responding to a shared cache-line request appearing on the local bus by examining its coherence state to further determine if inter-node action is required to service the request and, if such action is required, transmitting a unique identifier and a coherency command to all the other ECUs. Each unit of information present in the shared space of the local memory is assigned, by the local ECU, a coherency state which may be: exclusive (the local copy of the requested information is unique in the cluster); 2) modified (the local copy has been updated by a CPU in the same node); 3) invalid (a local copy either does not exist or is known to be out-of-date); or 4) shared (the local copy is one of a plurality of current copies present in a plurality of nodes).

Proceedings ArticleDOI
01 Sep 1992
TL;DR: A hybrid design based on the combination of non-blocking and prefetching caches is proposed, which is found to be very effective in reducing the memory latency penalty for many applications.
Abstract: Non-blocking caches and prefetehing caches are two techniques for hiding memory latency by exploiting the overlap of processor computations with data accesses. A nonblocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed, thus exploiting post-miss operations, A prefetching cache generates prefetch requests to bring data in the cache before it is actually needed, thus allowing overlap with premiss computations. In this paper, we evaluate the effectiveness of these two hardware-based schemes. We propose a hybrid design based on the combination of these approaches. We also consider compiler-based optimization to enhance the effectiveness of non-blocking caches. Results from instruction level simulations on the SPEC benchmarks show that the hardware prefetching caches generally outperform nonblocking caches. Also, the relative effectiveness of nonblocklng caches is more adversely affected by an increase in memory latency than that of prefetching caches,, However, the performance of non-blocking caches can be improved substantially by compiler optimizations such as instruction scheduling and register renaming. The hybrid design cm be very effective in reducing the memory latency penalty for many applications.

Journal ArticleDOI
Harold S. Stone1, John Turek1, Joel L. Wolf1
TL;DR: An efficient combinatorial algorithm for determining the optimal steady-state allocation, which, in theory, could be used to reduce the length of the transient, is described and generalizes to multilevel cache memories.
Abstract: A model for studying the optimal allocation of cache memory among two or more competing processes is developed and used to show that, for the examples studied, the least recently used (LRU) replacement strategy produces cache allocations that are very close to optimal. It is also shown that when program behavior changes, LRU replacement moves quickly toward the steady-state allocation if it is far from optimal, but converges slowly as the allocation approaches the steady-state allocation. An efficient combinatorial algorithm for determining the optimal steady-state allocation, which, in theory, could be used to reduce the length of the transient, is described. The algorithm generalizes to multilevel cache memories. For multiprogrammed systems, a cache-replacement policy better than LRU replacement is given. The policy increases the memory available to the running process until the allocation reaches a threshold time beyond which the replacement policy does not increase the cache memory allocated to the running process. >

Proceedings ArticleDOI
01 Jun 1992
TL;DR: MemSpy is described, a prototype tool that helps programmers identify and fix memory bottlenecks in both sequential and parallel programs and introduces the notion of data oriented, in addition to code oriented, performance tuning.
Abstract: To cope with the increasing difference between processor and main memory speeds, modern computer systems use deep memory hierarchies. In the presence of such hierarchies, the performance attained by an application is largely determined by its memory reference behavior—if most references hit in the cache, the performance is significantly higher than if most references have to go to main memory. Frequently, it is possible for the programmer to restructure the data or code to achieve better memory reference behavior. Unfortunately, most existing performance debugging tools do not assist the programmer in this component of the overall performance tuning task.This paper describes MemSpy, a prototype tool that helps programmers identify and fix memory bottlenecks in both sequential and parallel programs. A key aspect of MemSpy is that it introduces the notion of data oriented, in addition to code oriented, performance tuning. Thus, for both source level code objects and data objects, MemSpy provides information such as cache miss rates, causes of cache misses, and in multiprocessors, information on cache invalidations and local versus remote memory misses. MemSpy also introduces a concise matrix presentation to allow programmers to view both code and data oriented statistics at the same time. This paper presents design and implementation issues for MemSpy, and gives a detailed case study using MemSpy to tune a parallel sparse matrix application. It shows how MemSpy helps pinpoint memory system bottlenecks, such as poor spatial locality and interference among data structures, and suggests paths for improvement.

Patent
02 Mar 1992
TL;DR: An improved branch prediction cache (BPC) as mentioned in this paper utilizes a hybrid cache structure that provides two levels of branch information caching, a shallow but wide structure (36 32-byte entries), which caches full prediction information for a limited number of branch instructions.
Abstract: An improved branch prediction cache (BPC) scheme that utilizes a hybrid cache structure. The BPC provides two levels of branch information caching. The fully associative first level BPC is a shallow but wide structure (36 32-byte entries), which caches full prediction information for a limited number of branch instructions. The second direct mapped level BPC is a deep but narrow structure (256 2-byte entries), which caches only partial prediction information, but does so for a much larger number of branch instructions. As each branch instruction is fetched and decoded, its address is used to perform parallel look-ups in the two branch prediction caches.

Proceedings ArticleDOI
04 May 1992
TL;DR: The author discusses how this channel can be closed and the performance effects of closing the channel, and the lattice scheduler is introduced, and its use in closing the cache channel is demonstrated.
Abstract: The lattice scheduler is a process scheduler that reduces the performance penalty of certain covert-channel countermeasures by scheduling processes using access class attributes. The lattice scheduler was developed as part of the covert-channel analysis of the VAX security kernel. The VAX security kernel is a virtual-machine monitor security kernel for the VAX architecture designed to meet the requirements of the A1 rating from the US National Computer Security Center. After describing the cache channel, a description is given of how this channel can be exploited using the VAX security kernel as an example. The author discusses how this channel can be closed and the performance effects of closing the channel. The lattice scheduler is introduced, and its use in closing the cache channel is demonstrated. Finally, the work illustrates the operation of the lattice scheduler through an extended example and concludes with a discussion of some variations of the basic scheduling algorithm. >

Patent
30 Mar 1992
TL;DR: In this article, a method for controlling coherence of data elements sharable among a plurality of independently-operating central processing complexes (CPCs) in a multi-system complex (called a parallel sysplex) which contains sysplex DASDds (direct access storage devices) and a high-speed SES (shared electronic storage) facility is presented.
Abstract: A method for controlling coherence of data elements sharable among a plurality of independently-operating CPCs (central processing complexes) in a multi-system complex (called a parallel sysplex) which contains sysplex DASDds (direct access storage devices) and a high-speed SES (shared electronic storage) facility. Sysplex shared data elements are stored in the sysplex DASD under a unique sysplex data element name, which is used for sysplex coherence control. Any CPC may copy any sysplex data element into a local cache buffers (LCB) in the CPC's main storage, where it has an associated sysplex validity bit. The copying CPC executes a sysplex coherence registration command which requests a SES processor to verify that the data element name already exists in the SES cache, and to store the name of the data element in a SES cache entry if found in the SES cache. Importantly, the registration command communicates to SES the CPC location of the validity bit for the LCB containing that data element copy. Each time another copy of the data element is stored in any CPC LCB, a registration command is executed to store the location of that copy's CPC validity bit into a local cache register (LCR) associated with its data element name. In this manner, each LCR accumulates all CPC locations for all LCB validity bits for all valid copies of the associated data element in the sysplex -- for maintaining data coherency throughout the sysplex.

Proceedings ArticleDOI
01 Apr 1992
TL;DR: In this article, the authors compare the performance of CC-NUMA and COMA and show that COMA's potential for performance improvement is limited to applications where data accesses by different processors are finely interleaved in memory space and, in addition, where capacity misses dominate over coherence misses.
Abstract: Two interesting variations of large-scale shared-memory machines that have recently emerged are cache-coherent non-uniform-memory-access machines (CC-NUMA) and cache-only memory architectures (COMA). They both have distributed main memory and use directory-based cache coherence. Unlike CC-NUMA, however, COMA machines automatically migrate and replicate data at the main-memory level in cache-line sized chunks. This paper compares the performance of these two classes of machines. We first present a qualitative model that shows that the relative performance is primarily determined by two factors: the relative magnitude of capacity misses versus coherence misses, and the granularity of data partitions in the application. We then present quantitative results using simulation studies for eight parallel applications (including all six applications from the SPLASH benchmark suite). We show that COMA's potential for performance improvement is limited to applications where data accesses by different processors are finely interleaved in memory space and, in addition, where capacity misses dominate over coherence misses. In other situations, for example where coherence misses dominate, COMA can actually perform worse than CC-NUMA due to increased miss latencies caused by its hierarchical directories. Finally, we propose a new architectural alternative, called COMA-F, that combines the advantages of both CC-NUMA and COMA.

Journal ArticleDOI
TL;DR: An analytical access time model for on-chip cache memories that shows the dependence of the cache access time on the cache parameters is described and it is shown that for given C, B, and A, optimum array configuration parameters can be used to minimize the access time.
Abstract: An analytical access time model for on-chip cache memories that shows the dependence of the cache access time on the cache parameters is described. The model includes general cache parameters, such as cache size (C), block size (B), and associativity (A), and array configuration parameters that are responsible for determining the subarray aspect ratio and the number of subarrays. With this model, a large cache design space can be covered, which cannot be done using only SPICE circuit simulation within a limited time. Using the model, it is shown that for given C, B, and A, optimum array configuration parameters can be used to minimize the access time; if the optimum array parameters are used, then the optimum access time is roughly proportional to the log (cache size), and when the optimum array parameters are used, larger block size gives smaller access time, but larger associativity does not give smaller access time because of the increase of the data-bus capacitances. >

Journal ArticleDOI
TL;DR: An adaptive algorithm for managing fully associative cache memories shared by several identifiable processes is presented and it is shown that such an increase in hit-ratio in a system with a heavy throughput of I/O requests can provide a significant decrease in disk response time.
Abstract: An adaptive algorithm for managing fully associative cache memories shared by several identifiable processes is presented. The on-line algorithm extends an earlier model due to H.S. Stone et al. (1989) and partitions the cache storage in disjoint blocks whose sizes are determined by the locality of the processes accessing the cache. Simulation results of traces for 32-MB disk caches show a relative improvement in the overall and read hit-ratios in the range of 1% to 2% over those generated by a conventional least recently used replacement algorithm. The analysis of a queuing network model shows that such an increase in hit-ratio in a system with a heavy throughput of I/O requests can provide a significant decrease in disk response time. >

Patent
05 Jun 1992
TL;DR: In this article, the posted write cache is further mirrored and parity-checked to assure data validity, and a mirror test is executed to verify data validity when posted write operations are only enabled when error free operation is assured.
Abstract: A host computer including a posted write cache for a disk drive system where the posted write cache includes battery backup to protect against potential loss of data in case of a power failure, and also including means for performing a method for determining if live data is present in the posted write cache upon power-up. The posted write cache is further mirrored and parity-checked to assure data validity. Performance increase is achieved since during normal operation data is written to the much faster cache and a completion indication is returned, and the data is flushed to the slower disk drive system at a more opportune time. Batteries provide power to the posted write cache in the event of a power failure. Upon subsequent power-up, a cache signature previously written in the posted write cache indicates that live data still resides in the posted write cache. If the cache signature is not present and the batteries are not fully discharged, a normal power up condition exists. If the cache signature is not present and the batteries are fully discharged, then the user is warned of possible data loss. A configuration identification code assures a proper correspondence between the posted write cache board and the disk drive system. A mirror test executed to verify data validity. Temporary and permanent error conditions are monitored so that posted write operations are only enabled when error-free operation is assured.

Patent
30 Mar 1992
TL;DR: In this article, a high-speed cache is shared by a plurality of independently-operating data systems in a multi-system data sharing complex, where each data system has access both to the highspeed cache and the lower-speed, secondary storage for obtaining and storing data.
Abstract: A high-speed cache is shared by a plurality of independently-operating data systems in a multi-system data sharing complex. Each data system has access both to the high-speed cache and the lower-speed, secondary storage for obtaining and storing data. Management logic and the high-speed cache assures that a block of data obtained form the cache for entry into the secondary storage will be consistent with the version of the block of data in the shared cache with non-blocking serialization allowing access to a changed version in the cache while castout is being performed. Castout classes are provided to facilitate efficient movement from the shared cache to DASD.

Patent
Richard Lewis Mattson1
04 Sep 1992
TL;DR: In this article, a method and means for dynamically partitioning an LRU cache partitioned into a global cache storing referenced objects of k different data types and k local caches storing objects of a single type is described.
Abstract: A method and means is disclosed for dynamically partitioning an LRU cache partitioned into a global cache storing referenced objects of k different data types and k local caches storing objects of a single type. Referenced objects are stored in the MRU position of the global cache and overflow is managed by destaging the LRU object from the global to the local cache having the same data type. Dynamic partitioning is accomplished by recursively creating and maintaining from a trace of objects an LRU list of referenced objects and associated data structures for each subcache, creating and maintaining a multi-planar array of partition distribution data from the lists and the trace as a collection of all possible of maximum and minimum subcache sizing, optimally resizing the subcache partitions by applying a dynamic programming heuristic to the multiplanar array, and readjusting the partitions accordingly.

Patent
15 Jul 1992
TL;DR: In this paper, an error transition mode (ETM) is used to prevent cache data not owned in the cache from being accessed by the main memory after the first write request while permitting writeback of the data owned by the cache.
Abstract: Writeback transactions from a processor and cache are fed to a main memory through a writeback queue, and non-writeback transactions from the processor and cache are fed to the main memory through a non-writeback queue. When a cache error is detected, an error transition mode (ETM) is entered that provides limited use of the data in the cache; a read or write request for data not owned in the cache is made to the main memory instead of the cache, even when the data is valid in the cache, although owned data is read from the cache. In ETM, when the processor makes a first write request to data not owned in the cache followed by a second write request to data owned in the cache, write data of the first write request is prevented from being received by the main memory after write data of the second request while permitting writeback of the data owned by the cache. Preferably this is done by sending the write requests from the processor through the non-writeback queue, and when a write request accesses data in a block of data owned by the cache, disowning the block of data in the cache and writing the disowned block of data back to the main memory.

Patent
28 May 1992
TL;DR: In this article, a dynamic determination is made on a cycle-by-cycle basis of whether data should be written to the cache with a dirty bit asserted, or the data to both the cache and main memory.
Abstract: Method and apparatus for reducing the access time required to write to memory and read from memory in a computer system having a cache-based memory. A dynamic determination is made on a cycle by cycle basis of whether data should be written to the cache with a dirty bit asserted, or the data should be written to both the cache and main memory. The write-through method is chosen where the write-through method is approximately as fast as the write-back method. Where the write-back method is substantially faster than the write-through method, the write-back method is chosen.

Patent
14 Dec 1992
TL;DR: In this article, a battery powered computer system determines when the system is not in use by monitoring various events associated with the operation of the system, such as cache read misses and write operations.
Abstract: A battery powered computer system determines when the system is not in use by monitoring various events associated with the operation of the system. The system preferably monitors the number of cache read misses and write operations, i.e., the cache hit rate, and reduces the system clock frequency when the cache hit rate rises above a certain level. When the cache hit rate is above a certain level, then it can be assumed that the processor is executing a tight loop, such as when the processor is waiting for a key to be pressed and then the frequency can be reduced without affecting system performance. Alternatively, the apparatus monitors the occurrence of memory page misses, I/O write cycles or other events to determine the level of activity of the computer system.

Journal ArticleDOI
TL;DR: By comparing synthetic traces with real traces of identical locality parameters, it is demonstrated that synthetic traces exhibit miss ratios and lifetime functions that compare well with those of the real traces they mimic, both in fully associative and in set-associate memories.
Abstract: Two techniques for producing synthetic address traces that produce good emulations of the locality of reference of real programs are presented. The first algorithm generates synthetic addresses by simulating a random walk in an infinite address-space with references governed by a hyperbolic probability law. The second algorithm is a refinement of the first in which the address space has a given finite size. The basic model for the random walk has two parameters that correspond to the working set size and the locality of reference. By comparing synthetic traces with real traces of identical locality parameters, it is demonstrated that synthetic traces exhibit miss ratios and lifetime functions that compare well with those of the real traces they mimic, both in fully associative and in set-associative memories. >

01 Sep 1992
TL;DR: Using two separate benchmark suites, the SPEC benchmarks and the Perfect Club, and concentrating on multiplication, a surprising amount of trivial and redundant operation is found.
Abstract: This report introduces the notion of trivial computation, where the appearance of simple operands reduces the complexity of a potentially difficult operation. An example of a trivial operation is integer divide-by-two; the division becomes a simple shift operation. Also discussed is the concept of redundant computation, where some operation repeatedly does the same function because it repeatedly sees the same operands. Using two separate benchmark suites, the SPEC benchmarks and the Perfect Club, and concentrating on multiplication, we find a surprising amount of trivial and redundant operation. Various architectural means of exploiting this knowledge to improve computational efficiency include detection of trivial operands, memoization, and the result cache.

Patent
21 Jan 1992
TL;DR: A simple mixed first level cache memory system as mentioned in this paper includes a level 1 cache (52) connected to a processor (54) by read data and write data lines (56) and (58).
Abstract: A simple mixed first level cache memory system (50) includes a level 1 cache (52) connected to a processor (54)by read data and write data lines (56) and (58). The level 1 cache (52) is connected to level 2 cache (60) by swap tag lines (62) and (64), swap data lines (66) and (68), multiplexer (70) and swap/read Line (72). The level 2 cache (60) is connected to the next lower level in the memorv hierarchy by write tag and write data lines (74) and (76). The next lower level in the memory hierarchy below the level 2 cache (60) is also connected by a read data line (78) through the multiplexer (70) and the swap/read line (72) to the level 1 cache (52). When processor (54) requires an instruction or data, it puts out an address on lines (80). If the instruction or data is present in the level 1 cache (52), it is supplied to the processor (54) on read data line (56). If the instruction or data is not present in the level 1 cache (52), the processor looks for it in the level 2 cache (60) by putting out the address of the instruction or data on lines (80). If the instruction or data is in the level 2 cache, it is supplied to the processor (54) through the level 1 cache (52) by means of a swap operation on tag swap lines (62) and (64), swap data lines (66) and (68), multiplexer (70) and swap/read data line (72). If the instruction or data is present in neither the level 1 cache (52) nor the level 2 cache (60), the address on lines (80) fetches the instruction or data from successively lower levels in the memory hierarchy as required via read data line (78), multiplexer (70) and swap/read data line (72). The instruction or data is then supplied from the level 1 cache to the processor (54).

Patent
08 May 1992
TL;DR: In this paper, an apparatus and method for switching the context of state elements of a very fast processor within a clock cycle when a cache miss occurs is presented, which is particularly useful for minimizing the average instruction cycle time for a processor with a main memory access time exceeding 15 processor clock cycles.
Abstract: An apparatus and method are disclosed for switching the context of state elements of a very fast processor within a clock cycle when a cache miss occurs. To date, processors either stay idle or execute instructions out of order when they encounter cache misses. As the speed of processors become faster, the penalty for a cache miss is heavier. Having multiple copies of state elements on the processor and coupling them to a multiplexer permits the processor to save the context of the current instructions and resume executing new instructions within one clock cycle. The invention disclosed is particularly useful for minimizing the average instruction cycle time for a processor with a main memory access time exceeding 15 processor clock cycles. It is understood that the number of processes who's states are duplicated may easily be a large number n.