scispace - formally typeset
Search or ask a question

Showing papers on "Smart Cache published in 1991"


Proceedings ArticleDOI
01 Apr 1991
TL;DR: It is shown that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes.
Abstract: Blocking is a well-known optimization technique for improving the effectiveness of memory hierarchies. Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks, so that data loaded into the faster levels of the memory hierarchy are reused. This paper presents cache performance data for blocked programs and evaluates several optimization to improve this performance. The data is obtained by a theoretical model of data conflicts in the cache, which has been validated by large amounts of simulation. We show that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes. The conventional wisdom of frying to use the entire cache, or even a fixed fraction of the cache, is incorrect. If a fixed block size is used for a given cache size, the block size that minimizes the expected number of cache misses is very small. Tailoring the block size according to the matrix size and cache parameters can improve the average performance and reduce the variance in performance for different matrix sizes. Finally, whenever possible, it is beneficial to copy non-contiguous reused data into consecutive locations.

982 citations


Proceedings ArticleDOI
01 Apr 1991
TL;DR: This work fed address traces of the processes running on a multi-tasking operating system through a cache simulator, to compute accurate cache-hit rates over short intervals, and estimated the cache performance reduction caused by a context switch.
Abstract: The sustained performance of fast processors is critically dependent on cache performance. Cache performance in turn depends on locality of reference. When an operating system switches contexts, the assumption of locality may be violated because the instructions and data of the newly-scheduled process may no longer be in the cache(s). Context-switching thus has a cost above that associated with that of the operations performed by the kernel. We fed address traces of the processes running on a multi-tasking operating system through a cache simulator, to compute accurate cache-hit rates over short intervals. By marking the output of such a simulation whenever a context switch occurs, and then aggregating the post-context-switch results of a large number of context switches, it is possible to estimate the cache performance reduction caused by a switch. Depending on cache parameters the net cost of a context switch appears to be in the thousands of cycles, or tens to hundreds of microseconds. This technical note is a preprint of a paper to appear in the Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.

272 citations


Proceedings ArticleDOI
01 Apr 1991
TL;DR: This paper presents a range of lock-based cache consistency algorithms that arise by viewing cache consistency as aiant of the well-understood problem of replicated data management, and uses a detailed simulation model to study the performance of these algorithm over a wide range of workloads end system resource configurations.
Abstract: In this paper, we examine the performance tradeoffs that are raised by caching data in the client workstations of a client-server DBMS. We begin by presenting a range of lock-based cache consistency algorithms that arise by viewing cache consistency as a v~iant of the well-understood problem of replicated data management. We then use a detailed simulation model to study the performance of these algorithm over a wide range of workloads end system resource configurations. The results illustrate the key performance tradeoffs related to clientserver cache consistency, and should be of use to designers of next-generation DBMS prototypes and products.

230 citations


Proceedings ArticleDOI
01 Apr 1991
TL;DR: This paper reports the cache performance of a set of vectorized numerical program from the Perfect Club benchmarks and describes two simple prefetch schemes to reduce the influence of long stride vector accesses and misses due IO block invalidations in mulliprocessor vector caches.
Abstract: This paper reports the cache performance of a set of vectorized numerical program from the Perfect Club benchmarks. Using a low cost trace driven simularion technique we show how a non-prefetching vector cache can result in unpredictable performance and how rhis unpredictability makes it difficult to find a good block size. We describe two simple prefetch schemes to reduce the influence of long stride vector accesses and misses due IO block invalidations in mulliprocessor vector caches. These two schemes are shown to have better performance than a non-prefetching cache.

163 citations


Patent
20 Aug 1991
TL;DR: In this paper, a multilevel cache buffer for a multiprocessor system is described, where each processor has a level one cache storage unit which interfaces with a level two cache unit and main storage unit shared by all processors.
Abstract: A multilevel cache buffer for a multiprocessor system in which each processor has a level one cache storage unit which interfaces with a level two cache unit and main storage unit shared by all processors. The multiprocessors share the level two cache according to a priority algorithm. When data in the level two cache is updated, corresponding data in level one caches is invalidated until it is updated.

115 citations


Patent
30 Aug 1991
TL;DR: In this article, a method and apparatus for incorporating cache line replacement and cache write policy information into the tag directories in a cache system is presented, which can be generalized to caches which include a number of ways greater than two by using a pseudo-LRU algorithm and utilizing group select bits in each way to distinguish between least recently used groups.
Abstract: A method and apparatus for incorporating cache line replacement and cache write policy information into the tag directories in a cache system. In a 2 way set-associative cache, one bit in each way's tag RAM is reserved for LRU information, and the bits are manipulated such that the Exclusive-OR of each way's bits points to the actual LRU cache way. Since all of these bits must be read when the cache controller determines whether a hit or miss has occurred, the bits are available when a cache miss occurs and a cache line replacement is required. The method can be generalized to caches which include a number of ways greater than two by using a pseudo-LRU algorithm and utilizing group select bits in each of the ways to distinguish between least recently used groups. Cache write policy information is stored in the tag RAM's to designate various memory areas as write-back or write-through. In this manner, system memory situated on an I/O bus which does not recognize inhibit cycles can have its data cached.

113 citations


Patent
Jamshed H. Mirza1
15 Apr 1991
TL;DR: In this paper, a cache bypass mechanism automatically avoids caching of data for instructions whose data references, for whatever reason, exhibit low cache hit ratio, and this record is used to decide whether its future references should be cached or not.
Abstract: A cache bypass mechanism automatically avoids caching of data for instructions whose data references, for whatever reason, exhibit low cache hit ratio. The mechanism keeps a record of an instruction's behavior in the immediate past, and this record is used to decide whether its future references should be cached or not. If an instruction is experiencing bad cache hit ratio, it is marked as non-cacheable, and its data references are made to bypass the cache. This avoids the additional penalty of unnecessarily fetching the remaining words in the line, reduces the demand on the memory bandwidth, avoids flushing the cache of useful data and, in parallel processing environments, prevents line thrashing. The cache management scheme is automatic and requires no compiler or user intervention.

82 citations


Journal ArticleDOI
TL;DR: This work reduces the program traces to the extent that exact performance can still be obtained from the reduced traces and devise an algorithm that can produce performance results for a variety of metrics for a large number of set-associative write-back caches in just a single simulation run.
Abstract: We propose improvements to current trace-driven cache simulation methods to make them faster and mnre economical. We attack the large time and space demands of cache simulation in two nays. First, we reduce the program traces to the extent that exact performance can still be obtained from the reduced traces. Second, we devise an algorithm that can produce performance results for a variety of metrics (hit ratio, write-back counts, bus traffic) for a large number of set-associative write-back caches in just a single simulation run. The trace reduction and the efficient simulation techniques are extended to parallel multiprocessor cache simulations. Our simulation results show that our approach substantially reduces the disk space needed to store the program traces and can dramatically speedup cache simulations and still produce the exact results.

77 citations


Patent
16 May 1991
TL;DR: In this article, a microprocessor architecture that includes capabilities for locking individual entries into its integrated instruction cache and data cache while leaving the remainder of the cache unlocked and available for use in capturing the microprocessor's dynamic locality of reference is presented.
Abstract: A microprocessor architecture that includes capabilities for locking individual entries into its integrated instruction cache and data cache while leaving the remainder of the cache unlocked and available for use in capturing the microprocessor's dynamic locality of reference. The microprocessor also includes the capability for locking instruction cache entries without requiring that the instructions be executed during the locking process.

57 citations


Proceedings Article
01 Jan 1991
TL;DR: This paper introduces the Express Ring architecture and presents a snooping cache coherence protocol for this machine, and shows how consistency of shared memory accesses can be efficiently maintained in a ring-connected multiprocessor.
Abstract: 1 Abstract-The Express Ring is a new architecture under investigation at the University of Southern California. Its main goal is to demonstrate that a slotted unidirectional ring with very fast point-to-point interconnections can be at least ten times faster than a shared bus, using the same technology, and may be the topology of choice for future shared-memory multiprocessors. In this paper we introduce the Express Ring architecture and present a snooping cache coherence protocol for this machine. This protocol shows how consistency of shared memory accesses can be efficiently maintained in a ring-connected multiprocessor. We analyze the proposed protocol and compare it to other more usual alternatives for point-to-point connected machines, such as the SCI cache coherence protocol and directory based protocols.

47 citations


ReportDOI
01 May 1991
TL;DR: Results suggest that garbage collection algorithms will play an important part in improving cache performance as processor speeds increase and two-way set-associative caches are shown to reduce the miss rate in stop-and-copy algorithms often by a factor of two and sometimes by almost five over direct-mapped caches.
Abstract: : Cache performance is an important part of total performance in modern computer systems. This paper describes the use of trace-driven simulation to estimate the effect of garbage collection algorithms on cache performance Traces from four large Common Lisp programs have been collected and analyzed with an all-associatively cache simulator. While previous work has focused on the effect of garbage collection on page reference locality this evaluation unambiguously shows that garbage collection algorithms can have a profound effect on cache performance as well. On processors with a direct-mapped cache a generation stop-and-copy algorithm exhibits a miss rate up to four times higher than a comparable generation mark-and-sweep algorithm. Furthermore, two-way set-associative caches are shown to reduce the miss rate in stop-and-copy algorithms often by a factor of two and sometimes by a factor of almost five over direct-mapped caches. As processor speeds increase, cache performance will play an increasing role in total performance. These results suggest that garbage collection algorithms will play an important part in improving that performance.

Patent
10 Jan 1991
TL;DR: In this paper, a data processing system (10) is provided having a secondary cache (34) for performing a deferred cache load, where the primary cache (26) is compared with the indexed entries in a primary cache, and the physical address corresponding to the single cache line stored in the secondary cache(34).
Abstract: A data processing system (10) is provided having a secondary cache (34) for performing a deferred cache load. The data processing system (10) has a pipelined integer unit (12) which uses an instruction prefetch unit (IPU) (12). The (IPU) (12) issues prefetch requests to a cache controller (22) and transfers a prefetch address to a cache address memory management unit (CAMMU) (24), for translation into a corresponding physical address. The physical address is compared with the indexed entries in a primary cache (26), and compared with the physical address corresponding to the single cache line stored in the secondary cache (34). When a prefetch miss occurs in both the primary (26) and the secondary cache (34), the cache controller (22) issues a bus transfer request to retrieve the requested cache line from an external memory (20). While a bus controller (16) performs the bus transfer, the cache controller (22) loads the primary cache (26) with the cache line currently stored in the secondary cache (34).

Patent
James E. Bohner1, Thang T. Do1, Richard J. Gusefski1, Kevin Huang1, Chon I. Lei1 
25 Feb 1991
TL;DR: In this paper, an inpage buffer is used between a cache and a slower storage device to provide data corresponding to subsequent requests, provided that the buffer is also able to contain such data.
Abstract: An inpage buffer is used between a cache and slower storage device. When a processor requests data, the cache is checked to see if the data is already in the cache. If not, a request for the data is sent to the slower storage device. The buffer receives the data from the slower storage device and provides the data to the processor that requested the data. The buffer then provides the data to the cache for storage provided that the cache is not working on a separate storage request from the processor. The data will be written into the cache from the buffer when the cache is free from such requests. The buffer is also able to provide data corresponding to subsequent requests provided it contains such data. This may happen if a request for the same data occurs, and the buffer has not yet written the data into the cache. It can also occur if the areas of the cache which can hold data from an area of the slower storage is inoperable for some reason. The buffer acts as a minicache when such a catastrophic error in the cache occurs.


Patent
James T. Brady1, Balakrishna R. Iyer1
23 Dec 1991
TL;DR: In this article, a method and apparatus for avoiding line-accessed cache misses during a replacement/selection (tournament) sorting process is presented, which avoids the second merge phase overhead that formerly doubled the sorting time necessary for larger cache sizes.
Abstract: A method and apparatus for avoiding line-accessed cache misses during a replacement/selection (tournament) sorting process. Prior to the sorting phase, the method includes the steps of sizing and writing maximal sets of sub-tree nodes of a nested ordering of keys, suitable for staging as cache lines. During the sort phase, the method includes the steps of prefetching into cache from CPU main memory one or more cache lines formed from a sub-tree of ancestor nodes immediate to the node in cache just selected for replacement. The combination of the clustering of ancestor nodes within individual cache lines and the prefetching of cache lines upon replacement node selection permits execution of the full tournament sort procedure without the normally-expected cache miss rate. For selection trees larger than those that can fit entirely into cache, the method avoids the second merge phase overhead that formerly doubled the sorting time necessary for larger cache sizes.

Book ChapterDOI
TL;DR: An overview of the SMART caching strategy is presented, as well as a dynamic programming algorithm which finds an allocation of cache segments to a set of periodic tasks that both minimizes the utilization of the task set and guaranteeing that thetask set remains schedulable using rate monotonic scheduling.
Abstract: Since they were first introduced in the IBM 360/85 in 1969, cache designs have been optimized for average case performance, which has opened a wide gap between average case performance and the worst case performance that is critical to real-time computing community. The SMART (Strategic Memory Allocation for Real-Time) cache design narrows this gap. This paper focuses on an analytical approach to cache allocation. An overview of the SMART caching strategy is presented, as well as a dynamic programming algorithm which finds an allocation of cache segments to a set of periodic tasks that both minimizes the utilization of the task set and guaranteeing that the task set remains schedulable using rate monotonic scheduling. Results which show SMART caches narrowing the gap between average and worst case performance to less than 10% are then presented. >

Journal ArticleDOI
TL;DR: A new cache design approach is described that makes use of a selective invalidation technique that invalidates only those cache entries that are not fresh, without interrupting the processor execution stream and without degrading the cache performance.
Abstract: On-chip memories are becoming an established feature in single-chip microprocessor designs because they significantly improve performance. It is particularly important for single-chip reduced instruction set computer (RISC) microprocessors to include large, high-speed memories, because RISC chips must reduce off-chip memory delays to achieve the shortest possible cycle time. The use of dynamic RAM for all on-chip cache results in all important increased density of local memory for a given scarce chip area, but complicates the processor control due to the inherent requirement for refreshing. By using simple circuit techniques and making a few modifications to cache organization, the refreshing requirement of dynamic RAM can be eliminated. This new cache design approach is described. It makes use of a selective invalidation technique that invalidates only those cache entries that are not fresh. This is accomplished without interrupting the processor execution stream and without degrading the cache performance. >

Patent
Steven Lee Gregor1
26 Jun 1991
TL;DR: In this article, a cache storage system having hardware for in-cache execution of storage-storage and storage-immediate instructions is proposed to obviate the need for data to be moved from the cache to a separate execution unit and back to cache.
Abstract: A cache storage system having hardware for in-cache execution of storage-storage and storage-immediate instructions thereby obviating the need for data to be moved from the cache to a separate execution unit and back to cache.


Patent
14 Jun 1991
TL;DR: In this article, the caches align themselves on a "way" basis by their respective cache controllers communicating with each other as to which blocks of data they are replacing and which of their cache ways are being filled with data.
Abstract: A method for achieving multilevel inclusion in a computer system with first and second level caches. The caches align themselves on a "way" basis by their respective cache controllers communicating with each other as to which blocks of data they are replacing and which of their cache ways are being filled with data. On first and second level cache read misses the first level cache controller provides way information to the second level cache controller to allow received data to be placed in the same way. On first level cache read misses and second level cache read hits, the second level cache controller provides way information the first level cache controller, which ignores its replacement indication and places data in the indicated way. On processor writes the first level cache controller caches the writes and provides the way information to the second level cache controller which also caches the writes and uses the way information to select the proper way for data storage. An inclusion bit is set on data in the second level cache that is duplicated in the first level cache. Multilevel inclusion allows the second level cache controller to perform the principal snooping responsibilities for both caches, thereby enabling the first level cache controller to avoid snooping duties until a first level cache snoop hit occurs. On a second level cache snoop hit, the second level cache controller checks the respective inclusion bit to determine if a copy of this data also resides in the first level cache. The first level cache controller is directed to snoop the bus only if the respective inclusion bit is set.

Patent
19 Dec 1991
TL;DR: In this article, a chipset is provided which powers up in a default state with caching disabled and which writes permanently non-cacheable tags into tag RAM entries corresponding to memory addresses being read while caching is disabled.
Abstract: According to the invention, a chipset is provided which powers up in a default state with caching disabled and which writes permanently non-cacheable tags into tag RAM entries corresponding to memory addresses being read while caching is disabled. Even though no "valid" bit is cleared, erroneous cache hits after caching is enabled are automatically prevented since any address which does match a tag in the tag RAM, is a non-cacheable address and will force retrieval directly from main memory anyway.

Journal ArticleDOI
TL;DR: A simple program model for data and block sharing is introduced, and an analytical closed-form solution is found for all components of the cache coherence overhead based on the observation that shared writable blocks are accessed in critical or in semicritical sections.
Abstract: Simulation is used to analyze shared block contention in eight parallel algorithms and its effects on the performance of a cache coherence protocol under the assumption of infinite cache sizes. A simple program model for data and block sharing is introduced, and an analytical closed-form solution is found for all components of the cache coherence overhead. This model is based on the observation that shared writable blocks are accessed in critical or in semicritical sections. The program model is applied to the analysis of multiprocessor systems with finite cache sizes and for steady state computations. The authors compare the model predictions to the results of execution-driven simulations of eight parallel algorithms. The simulation is conducted for various numbers of processors and different cache block sizes. >

Proceedings ArticleDOI
01 Apr 1991
TL;DR: The results show that memory access time and page-size constraints limit the size of the primary data and instruc- tion caches to 4I
Abstract: In the near future, microprocessor systems with very high clock rates will use multichip module (MCM) pack- aging technology to reduce chip-crossing delays. In this paper we present the results of a study for the design of a 250 MHz Gallium Arsenide (GaAs) microprocessor t,lrat employs h4CM technology to improve performance. The design study for the resulting two-level split cache st.arts with a baseline cache architecture and then ex- amines the following aspects: 1) primary cache size and degree of associativity; 2) primary data-cache write pol- icy; 3) secondary cache size and organization; 4) pri- mary cache fetch size; 5) concurrency between instruc- tion and data accesses. A trace-driven simulator is used to analyze each design's performance. The results show that memory access time and page-size constraints ef- Cectively limit the size of the primary data and instruc- tion caches to 4I

Patent
25 Mar 1991
TL;DR: In this paper, a memory system utilizes miss caching by incorporating a small fully-associative miss cache between a cache (18 or 20) and second-level cache (26).
Abstract: of EP0449540A memory system (10) utilizes miss caching by incorporating a small fully-associative miss cache (42) between a cache (18 or 20) and second-level cache (26) Misses in the cache (18 or 20) that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache (42) Victim caching is an improvement to miss caching that loads a small, fully associative cache (52) with the victim of a miss and not the requested line Small victim caches (52) of 1 to 4 entries are even more effective at removing conflict misses than miss caching Stream buffers (62) prefetch cache lines starting at a cache miss address The prefetched data is placed in the buffer (62) and not in the cache (18 or 20) Stream buffers (62) are useful in removing capacity and compulsory cache misses, as well as some instruction cache misses Stream buffers (62) are more effective than previously investigated prefetch techniques when the next slower level in the memory hierarchy is pipelined An extension to the basic stream buffer, called multi-way stream buffers (62), is useful for prefetching along multiple intertwined data reference streams

Proceedings ArticleDOI
01 Sep 1991
TL;DR: New cache architectures that address the problems of conflict misses and non-optimal line sizes in the context of direct-mapped caches and can be reconfigured by software in a way that matches the reference pattern for array data structures are presented.
Abstract: Cache memory has shown to be the most important technique to bridge the gap between the processor speed and the memory access time. The advent of high-speed RISC and superscalar processors, however, calls for small on-chip data caches. Due to physical limitations, these should be simply designed and yet yield good performance. In this paper, we present new cache architectures that address the problems of conflict misses and non-optimal line sizes in the context of direct-mapped caches. Our cache architectures can be reconfigured by software in a way that matches the reference pattern for array data structures. We show that the implementation cost of the reconfiguration capability is neglectable. We also show simulation results !M demons tratc sign i fican t performance improvements for both methods.

Proceedings ArticleDOI
30 Apr 1991
TL;DR: Two simple prefetch schemes that reduce the influence of long stride vector accesses on cache performance and have better performance than the nonprefetching cache are presented.
Abstract: Reports the cache performance of a set of vectorized numerical programs from the Perfect Club benchmarks. Using a low cost trace driven simulation technique it is shown how a nonprefetching vector cache can result in unpredictable performance and how this unpredictability makes it difficult to find a good block size. Two simple prefetch schemes that reduce the influence of long stride vector accesses on cache performance and have better performance than the nonprefetching cache are presented. >


01 May 1991
TL;DR: The SMART (Strategic Memory Allocation for Real-Time) cache design approach narrows the gap between average and worst case performance and the impressive average case performance provided by conventional caches.
Abstract: Since they were first introduced in the IBM 360/85 in 1969, the primary application of cache memories has been in the general purpose computing community. Thus, it is no surprise that modern cache designs are optimized for average case performance. This optimization criterion has opened a wide gap between the average case performance which is important to general purpose computing and the worst case performance that is critical to real-time computing, thereby delaying the adoption of caches by the real-time community. The SMART (Strategic Memory Allocation for Real-Time) cache design approach narrows the gap between this worst case performance and the impressive average case performance provided by conventional caches. The SMART design approach is a software controlled partitioning strategy which allocates cache partitions to qualifying tasks. The hardware requirements for this partitioning are minimal as demonstrated through an example implementation with the MIPS R3000 processor. An algorithm which optimally allocates cache segments to a set of periodic tasks using rate monotonic scheduling has been developed. This algorithm, which minimizes task set utilization while guaranteeing schedulability, uses dynamic programming to reduce the tree-based exponential search space to a polynomial one. This reduction of the search space is critical to the goal of providing dynamic reallocation of cache segments during mode switches in real-time systems. Simulation results show SMART caches narrowing the gap between average and worst case performance to less than 10%.

Proceedings Article
01 Jan 1991
TL;DR: It is found that the performance of the cache grouping scheme closely approaches that of a full-directory scheme, and the system performance is relatively insensitive to cache group size or the availability of sophisticated multicast and combining features in the network.
Abstract: A scheme which employs cache grouping and incomplete directory state in order to reduce the cost of maintaining directory state in a shared memory coherent cache system was introduced in earlier work by the authors and others. In this paper we report on detailed simulation studies of a cache grouping scheme employing multistaged network interconnects. We examine the effects of cache group size and support for multicast and combining in the network. We find that the performance of the cache grouping scheme closely approaches that of a full-directory scheme. We also learn that, due to the dominance of one-to-one invalidates in our example applications, the system performance is relatively insensitive to cache group size or the availability of sophisticated multicast and combining features in the network, at least for the relatively small systems we are capable of simulating. 11 refs., 4 figs., 7 tabs.

Patent
Jeffrey L. Nye1
30 Apr 1991
TL;DR: A data memory management unit for providing cache access in a signal processing system as discussed by the authors is used to alter the manner in which addresses are translated and the cache is filled, respectively, in accordance with a selected cache replacement mechanism, which is selected so that the cache are filled in a manner most likely to have a high hit rate for the processing algorithm currently in operation.
Abstract: A data memory management unit for providing cache access in a signal processing system. A programmable translation unit and an address processor are used to alter the manner in which addresses are translated and the cache is filled, respectively. The translation unit and address processor operate in accordance with a selected cache replacement mechanism, which is selected so that the cache is filled in a manner most likely to have a high hit rate for the processing algorithm currently in operation.