scispace - formally typeset
Search or ask a question

Showing papers on "Cache algorithms published in 1994"


Proceedings ArticleDOI
14 Nov 1994
TL;DR: In this article, the authors examine four cooperative caching algorithms using a trace-driven simulation study and conclude that cooperative caching can significantly improve file system read response time and that relatively simple cooperative caching methods are sufficient to realize most of the potential performance gain.
Abstract: Emerging high-speed networks will allow machines to access remote data nearly as quickly as they can access local data. This trend motivates the use of cooperative caching: coordinating the file caches of many machines distributed on a LAN to form a more effective overall file cache. In this paper we examine four cooperative caching algorithms using a trace-driven simulation study. These simulations indicate that for the systems studied cooperative caching can halve the number of disk accesses, improving file system read response time by as much as 73%. Based on these simulations we conclude that cooperative caching can significantly improve file system read response time and that relatively simple cooperative caching algorithms are sufficient to realize most of the potential performance gain.

491 citations


Proceedings ArticleDOI
01 Apr 1994
TL;DR: The results show that, for the majority of the benchmarks, stream buffers can attain hit rates that are comparable to typical hit rates of secondary caches, and as the data-set size of the scientific workload increases the performance of streams typically improves relative to secondary cache performance, showing that streams are more scalable to large data- set sizes.
Abstract: Today's commodity microprocessors require a low latency memory system to achieve high sustained performance. The conventional high-performance memory system provides fast data access via a large secondary cache. But large secondary caches can be expensive, particularly in large-scale parallel systems with many processors (and thus many caches).We evaluate a memory system design that can be both cost-effective as well as provide better performance, particularly for scientific workloads: a single level of (on-chip) cache backed up only by Jouppi's stream buffers [10] and a main memory. This memory system requires very little hardware compared to a large secondary cache and doesn't require modifications to commodity processors. We use trace-driven simulation of fifteen scientific applications from the NAS and PERFECT suites in our evaluation. We present two techniques to enhance the effectiveness of Jouppi's original stream buffers: filtering schemes to reduce their memory bandwidth requirement and a scheme that enables stream buffers to prefetch data being accessed in large strides. Our results show that, for the majority of our benchmarks, stream buffers can attain hit rates that are comparable to typical hit rates of secondary caches. Also, we find that as the data-set size of the scientific workload increases the performance of streams typically improves relative to secondary cache performance, showing that streams are more scalable to large data-set sizes.

368 citations


Journal ArticleDOI
TL;DR: This article describes a caching strategy that offers the performance of caches twice its size and investigates three cache replacement algorithms: random replacement, least recently used, and a frequency-based variation of LRU known as segmented LRU (SLRU).
Abstract: I/O subsystem manufacturers attempt to reduce latency by increasing disk rotation speeds, incorporating more intelligent disk scheduling algorithms, increasing I/O bus speed, using solid-state disks, and implementing caches at various places in the I/O stream. In this article, we examine the use of caching as a means to increase system response time and improve the data throughput of the disk subsystem. Caching can help to alleviate I/O subsystem bottlenecks caused by mechanical latencies. This article describes a caching strategy that offers the performance of caches twice its size. After explaining some basic caching issues, we examine some popular caching strategies and cache replacement algorithms, as well as the advantages and disadvantages of caching at different levels of the computer system hierarchy. Finally, we investigate the performance of three cache replacement algorithms: random replacement (RR), least recently used (LRU), and a frequency-based variation of LRU known as segmented LRU (SLRU). >

325 citations


Journal ArticleDOI
TL;DR: To mitigate false sharing and to enhance spatial locality, the layout of shared data in cache blocks is optimized in a programmer-transparent manner and it is shown that this approach can reduce the number of misses on shared data by about 10% on average.
Abstract: The performance of the data cache in shared-memory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. Some researchers have speculated that this effect is due to false sharing, the coherence transactions that result when different processors update different words of the same cache block in an interleaved fashion. While the analysis of six applications in the paper confirms that false sharing has a significant impact on the miss rate, the measurements also show that poor spatial locality among accesses to shared data has an even larger impact. To mitigate false sharing and to enhance spatial locality, we optimize the layout of shared data in cache blocks in a programmer-transparent manner. We show that this approach can reduce the number of misses on shared data by about 10% on average. >

265 citations


Journal ArticleDOI
TL;DR: It is shown that cache profiling, using the CProf cache profiling system, improves program performance by focusing a programmer's attention on problematic code sections and providing insight into appropriate program transformations.
Abstract: A vital tool-box component, the CProf cache profiling system lets programmers identify hot spots by providing cache performance information at the source-line and data-structure level. Our purpose is to introduce a broad audience to cache performance profiling and tuning techniques. Although used sporadically in the supercomputer and multiprocessor communities, these techniques also have broad applicability to programs running on fast uniprocessor workstations. We show that cache profiling, using our CProf cache profiling system, improves program performance by focusing a programmer's attention on problematic code sections and providing insight into appropriate program transformations. >

242 citations


Patent
29 Jul 1994
TL;DR: In this article, a technique called key swizzling is proposed to reduce the volume of queries to the structured database by using explicit relationship pointers between object instances in the object cache.
Abstract: In an object-oriented application being executed in a digital computing system comprising a processor, a method and apparatus are provided for managing information retrieved from a structured database, such as a relational database, wherein the processor is used to construct a plurality of object instances, each of these object instances having its own unique object ID that provides a mapping between the object instance and at least one row in the structured database. The processor is used to construct a single cohesive data structure, called an object cache, that comprises all the object instances and that represents information retrieved from the structured database in a form suitable for use by one or more object-oriented applications. A mechanism for managing the object cache is provided that has these three properties: First, through a technique called key swizzling, it uses explicit relationship pointers between object instances in the object cache to reduce the volume of queries to the structured database. Second, it ensures that only one copy of an object instance is in the cache at any given time, even if several different queries return the same information from the database. Third, the mechanism guarantees the integrity of data in the cache by locking data appropriately in the structured database during a database transaction, flushing cache data at the end of each transaction, and transparently re-reading the data and reacquiring the appropriate locks for an object instance whose data has been flushed.

242 citations


Proceedings ArticleDOI
07 Dec 1994
TL;DR: This paper describes an approach for bounding the worst-case instruction cache performance of large code segments by using static cache simulation to analyze a program's control flow to statically categorize the caching behavior of each instruction.
Abstract: The use of caches poses a difficult tradeoff for architects of real-time systems. While caches provide significant performance advantages, they have also been viewed as inherently unpredictable, since the behavior of a cache reference depends upon the history of the previous references. The use of caches is only suitable for real-time systems if a reasonably tight bound on the performance of programs using cache memory can be predicted. This paper describes an approach for bounding the worst-case instruction cache performance of large code segments. First, a new method called static cache simulation is used to analyze a program's control flow to statically categorize the caching behavior of each instruction. A timing analyzer, which uses the categorization information, then estimates the worst-case instruction cache performance for each loop and function in the program. >

233 citations


Proceedings Article
12 Sep 1994
TL;DR: It is shown that there are significant benefits in redesigning traditional query processing algorithms so that they can make better use of the cache, and new algorithms run 8%-200% faster than the traditional ones.
Abstract: The current main memory (DRAM) access speeds lag far behind CPU speeds Cache memory, made of static RAM, is being used in today’s architectures to bridge this gap It provides access latencies of 2-4 processor cycles, in contrast to main memory which requires 15-25 cycles Therefore, the performance of the CPU depends upon how well the cache can be utilized We show that there are significant benefits in redesigning our traditional query processing algorithms so that they can make better use of the cache The new algorithms run 8%-200% faster than the traditional ones

215 citations


Proceedings ArticleDOI
01 Nov 1994
TL;DR: The data presented shows that increasing the associativity of second-level caches can reduce miss rates significantly and should help system designers choose a cache configuration that will perform well in commercial markets.
Abstract: Experience has shown that many widely used benchmarks are poor predictors of the performance of systems running commercial applications. Research into this anomaly has long been hampered by a lack of address traces from representative multi-user commercial workloads. This paper presents research, using traces of industry-standard commercial benchmarks, which examines the characteristic differences between technical and commercial workloads and illustrates how those differences affect cache performance.Commercial and technical environments differ in their respective branch behavior, operating system activity, I/O, and dispatching characteristics. A wide range of uniprocessor instruction and data cache geometries were studied. The instruction cache results for commercial workloads demonstrate that instruction cache performance can no longer be neglected because these workloads have much larger code working sets than technical applications. For database workloads, a breakdown of kernel and user behavior reveals that the application component can exhibit behavior similar to the operating system and therefore, can experience miss rates equally high. This paper also indicates that “dispatching” or process switching characteristics must be considered when designing level-two caches. The data presented shows that increasing the associativity of second-level caches can reduce miss rates significantly. Overall, the results of this research should help system designers choose a cache configuration that will perform well in commercial markets.

196 citations


Proceedings ArticleDOI
01 Apr 1994
TL;DR: Two-level exclusive caching improves the performance of two-level caching organizations by increasing the effective associativity and capacity.
Abstract: The performance of two-level on-chip caching is investigated for a range of technology and architecture assumptions. The area and access time of each level of cache is modeled in detail. The results indicate that for most workloads, two-level cache configurations (with a set-associative second level) perform marginally better than single-level cache configurations that require the same chip area once the first-level cache sizes are 64KB or larger. Two-level configurations become even more important in systems with no off-chip cache and in systems in which the memory cells in the first-level caches are multiported and hence larger than those in the second-level cache. Finally, a new replacement policy called two-level exclusive caching is introduced. Two-level exclusive caching improves the performance of two-level caching organizations by increasing the effective associativity and capacity.

195 citations


Journal ArticleDOI
01 Nov 1994
TL;DR: Using trace-driven simulation of applications and the operating system, it is shown that a CML buffer enables a large direct-mapped cache to perform nearly as well as a two-way set associative cache of equivalent size and speed, although with lower hardware cost and complexity.
Abstract: This paper describes a method for improving the performance of a large direct-mapped cache by reducing the number of conflict misses. Our solution consists of two components: an inexpensive hardware device called a Cache Miss Lookaside (CML) buffer that detects conflicts by recording and summarizing a history of cache misses, and a software policy within the operating system's virtual memory system that removes conflicts by dynamically remapping pages whenever large numbers of conflict misses are detected. Using trace-driven simulation of applications and the operating system, we show that a CML buffer enables a large direct-mapped cache to perform nearly as well as a two-way set associative cache of equivalent size and speed, although with lower hardware cost and complexity.

Patent
Alexander D. Peleg1, Uri Weiser1
30 Mar 1994
TL;DR: In this paper, the authors propose an improved cache and organization particularly suitable for superscalar architectures, where the cache is organized around trace segments of running programs rather than an organization based on memory addresses.
Abstract: An improved cache and organization particularly suitable for superscalar architectures. The cache is organized around trace segments of running programs rather than an organization based on memory addresses. A single access to the cache memory may cross virtual address line boundaries. Branch prediction is integrally incorporated into the cache array permitting the crossing of branch boundaries with a single access.

Proceedings Article
06 Jun 1994
TL;DR: The main contribution of this paper is the solution to the allocation problem, which allows processes to manage their own cache blocks, while at the same time maintains the dynamic allocation of cache blocks among processes.
Abstract: We consider how to improve the performance of file caching by allowing user-level control over file cache replacement decisions. We use two-level cache management: the kernel allocates physical pages to individual applications (allocation), and each application is responsible for deciding how to use its physical pages (replacement). Previous work on two-level memory management has focused on replacement, largely ignoring allocation. The main contribution of this paper is our solution to the allocation problem. Our solution allows processes to manage their own cache blocks, while at the same time maintains the dynamic allocation of cache blocks among processes. Our solution makes sure that good user-level policies can improve the file cache hit ratios of the entire system over the existing replacement approach. We evaluate our scheme by trace-based simulation, demonstrating that it leads to significant improvements in hit ratios for a variety of applications.

Proceedings ArticleDOI
14 Nov 1994
TL;DR: It is demonstrated that for applications that do not perform well under traditional caching policies, the combination of good application-chosen replacement strategies, and the kernel allocation policy LRU-SP, can reduce the number of block I/Os and reduce the elapsed time by up to 45%.
Abstract: Traditional file system implementations do not allow applications to control file caching replacement decisions. We have implemented two-level replacement, a scheme that allows applications to control their own cache replacement, while letting the kernel control the allocation of cache space among processes. We designed an interface to let applications exert control on replacement via a set of directives to the kernel. This is effective and requires low overhead.We demonstrate that for applications that do not perform well under traditional caching policies, the combination of good application-chosen replacement strategies, and our kernel allocation policy LRU-SP, can reduce the number of block I/Os by up to 80%, and can reduce the elapsed time by up to 45%. We also show that LRU-SP is crucial to the performance improvement for multiple concurrent applications: LRU-SP fairly distributes cache blocks and offers protection against foolish applications.

Patent
16 May 1994
TL;DR: In this article, a cache memory is organized in multiple levels, each level having multiple entries, the entries of each level receiving information of a predetermined category, each entry being accessible independently, links are defined between entries of one level of the cache memory and entries at another level of cache memory, the links corresponding to information relationships specified by a user of information stored in the secondary storage.
Abstract: Information needed by application programs from a secondary storage is cached in a cache memory which is organized in multiple levels, each level having multiple entries, the entries of each level receiving information of a predetermined category, each entry being accessible independently. Links are defined between entries of one level of the cache memory and entries at another level of the cache memory, the links corresponding to information relationships specified by a user of information stored in the secondary storage. In response to a request to a file system from an application for needed information, the needed information is fetched into the cache, and in connection with fetching the needed information, other information is prefetched from the system of files which is not immediately needed. Quotas are established on information which may be fetched from a secondary storage into the cache, the quotas being applicable to file contents within a file and to the number of files within a directory. Upon a request from an application program to open a file, an entry is created in a cache corresponding to the file, the entry including file header information. The file header information is retained in the cache so long as the file remains open, whether or not any file contents of the file remain in cache. The entries of three level of the cache respectively receive directory information, file header information, and file contents information.

Patent
26 Apr 1994
TL;DR: In this article, a hierarchical memory system is provided which includes a cache and long-term storage, in which an address of a requested data block is translated to a second addressing scheme, and is meshed, so that proximate data blocks are placed on different physical target disks within the longterm storage.
Abstract: A data processing system (10) has a processor with a processor memory and a mechanism for specifying an address that corresponds to a processor-requested data block located within another memory to be accessed by the processor. A hierarchical memory system is provided which includes a cache (16) and long-term storage (20). In accordance with a mapping and meshing process performed by a memory subsystem (22), an address of a requested data block is translated to a second addressing scheme, and is meshed, so that proximate data blocks are placed on different physical target disks within the long-term storage. In accordance with a cache drain mechanism, the cache will drain data from the cache to the physical target disks under different specified conditions. A further mechanism is provided for preserving data within the cache that is frequently accessed by the requester processor. A user-configuration mechanism is provided.

Proceedings ArticleDOI
01 Apr 1994
TL;DR: This paper has investigated the SPEC92 benchmarks and found that for the integer benchmarks, a simple hit-under-miss implementation achieves almost all of the available performance improvement for relatively little cost, however, for most of the numeric benchmarks, more expensive implementations are worthwhile.
Abstract: Non-blocking loads are a very effective technique for tolerating the cache-miss latency on data cache references. In this paper, we describe several methods for implementing non-blocking loads. A range of resulting hardware complexity/performance tradeoffs are investigated using an object-code translation and instrumentation system. We have investigated the SPEC92 benchmarks and have found that for the integer benchmarks, a simple hit-under-miss implementation achieves almost all of the available performance improvement for relatively little cost. However, for most of the numeric benchmarks, more expensive implementations are worthwhile. The results also point out the importance of using a compiler capable of scheduling load instructions for cache misses rather than cache hits in non-blocking systems.


Patent
20 Jun 1994
TL;DR: In this paper, a cache storage drawer containing a plurality of DASD devices for implementing a RAID parity data protection scheme, and permanently storing data, is coupled with a cache controller.
Abstract: A system and method for reducing device wait time in response to a host initiated write operation modifying a data block. The system includes a host computer channel connected to a storage controller which has cache memory and a nonvolatile storage buffer in a first embodiment. An identical system makes up the second embodiment with the exception that there is no nonvolatile storage buffer in the storage controller of the second embodiment. The controller in either embodiment is coupled to a cache storage drawer containing a plurality of DASD devices for implementing a RAID parity data protection scheme, and for permanently storing data. The drawer has nonvolatile cache memory which is used for accepting data destaged from controller cache. In a first embodiment, no commit reply is sent to the controller to indicate that data has been written to DASD. Instead a status information block is created to indicate that the data has been destaged from controller cache but is not committed. The status information is stored in directory means attached to the controller. The system uses this information to create a list of data which is in the state of Not committed. In this way data can be committed according to a cache management algorithm of least recently used (LRU), rather than requiring synchronous commit which is inefficient because it requires waiting on a commit response and ties up nonvolatile storage space allocated to back-up copies of cache data. In a second embodiment, directory means attached to the controller stores information about status blocks that may be modified or unmodified. The status information is used to eliminate wait times associated with waiting for data to be written to HDAs below.

Proceedings ArticleDOI
28 Feb 1994
TL;DR: A new processor implementing Hewlett-Packard's PA-RISC 1.1 (Precision Architecture) has been designed, which incorporates many improvements over the HP PA7100 CPU, including increased frequency, instruction and data cache prefetching, enhanced superscalar execution, and enhanced multiprocessor support.
Abstract: A new processor implementing Hewlett-Packard's PA-RISC 1.1 (Precision Architecture) has been designed. This latest design incorporates many improvements over the HP PA7100 CPU, including increased frequency, instruction and data cache prefetching, enhanced superscalar execution, and enhanced multiprocessor support. The PA7200 connects directly to a new split transaction, 120 MHz, 64-bit bus capable of supporting multiple processors and multiple outstanding memory reads per processor. A novel fully associative on-chip data cache, which is accessed in parallel with an external data cache, is used to reduce the miss rate and facilitate hardware and software directed prefetching to reduce average memory access time. >

Patent
12 Dec 1994
TL;DR: In this paper, a cache indexer maintains a current index of data elements which are stored in cache memory and a sequential data access indicator, responsive to the cache index and to a user selectable sequential access threshold, determines that a sequential access is in progress for a given process and provides an indication of the same.
Abstract: A cache management system and method monitors and controls the contents of cache memory coupled to at least one host and at least one data storage device. A cache indexer maintains a current index of data elements which are stored in cache memory. A sequential data access indicator, responsive to the cache index and to a user selectable sequential data access threshold, determines that a sequential data access is in progress for a given process and provides an indication of the same. The system and method allocate a micro-cache memory to any process performing a sequential data access. In response to the indication of a sequential data access in progress and to a user selectable maximum number of data elements to be prefetched, a data retrieval requestor requests retrieval of up to the selected maximum number of data elements from a data storage device. A user selectable number of sequential data elements determines when previously used micro-cache memory locations will be overwritten. A method of dynamically monitoring and adjusting cache management parameters is also presented.

Patent
23 Dec 1994
TL;DR: In this paper, a column-associative cache that reduces conflict misses, increases the hit rate and maintains a minimum hit access time is proposed, where the cache lines represent a column of sets.
Abstract: A column-associative cache that reduces conflict misses, increases the hit rate and maintains a minimum hit access time. The column-associative cache indexes data from a main memory into a plurality of cache lines according to a tag and index field through hash and rehash functions. The cache lines represent a column of sets. Each cache line contains a rehash block indicating whether the set is a rehash location. To increase the performance of the column-associative cache, a content addressable memory (CAM) is used to predict future conflict misses.

Patent
13 Jul 1994
TL;DR: In this article, an analytical model for calculating cache hit rate for combinations of data sets and LRU sizes is presented. But the model is not directly applied in software for constructing a precise model that can be used to predict cache hit rates for a cache, using statistics accumulated for each element independently.
Abstract: Method and structure for collecting statistics for quantifying locality of data and thus selecting elements to be cached, and then calculating the overall cache hit rate as a function of cached elements. LRU stack distance has a straight-forward probabilistic interpretation and is part of statistics to quantify locality of data for each element considered for caching. Request rates for additional slots in the LRU are a function of file request rate and LRU size. Cache hit rate is a function of locality of data and the relative request rates for data sets. Specific locality parameters for each data set and arrival rate of requests for data-sets are used to produce an analytical model for calculating cache hit rate for combinations of data sets and LRU sizes. This invention provides algorithms that can be directly implemented in software for constructing a precise model that can be used to predict cache hit rates for a cache, using statistics accumulated for each element independently. The model can rank the elements to find the best candidates for caching. Instead of considering the cache as a whole, the average arrival rates and re-reference statistics for each element are estimated, and then used to consider various combinations of elements and cache sizes in predicting the cache hit rate. Cache hit rate is directly calculated using the to-be-cached files' arrival rates and re-reference statistics and used to rank the elements to find the set that produces the optimal cache hit rate.

Proceedings ArticleDOI
12 Sep 1994
TL;DR: In this article, the cache kernel is proposed to provide a hardware adaptation layer to operating system services rather than just providing a key subset of OS services, as has been the common approach in previous microkernel work.
Abstract: Operating system design has had limited success in providing adequate application functionality and a poor record in avoiding excessive growth in size and complexity, especially with protected operating systems. Applications require far greater control over memory, I/O and processing resources to meet their requirements. For example, database transaction processing systems include their own "kernel" which can much better manage resources for the application than can the application-ignorant general-purpose conventional operating system mechanisms. Large-scale parallel applications have similar requirements. The same requirements arise with servers implemented outside the operating system kernel.In our research, we have been exploring the approach of making the operating system kernel a cache for active operating systems objects such as processes, address spaces and communication channels, rather than a complete manager of these objects. The resulting system is smaller than recent so-called microkernels, and also provides greater flexibility for applications, including real-time applications, database management systems and large-scale simulations. As part of this research, we have developed what we call a cache kernel, a new generation of microkernel that supports operating system configurations across these dimensions.The cache kernel can also be regarded as providing a hardware adaptation layer (HAL) to operating system services rather than trying to just provide a key subset of OS services, as has been the common approach in previous microkernel work. However, in contrast to conventional HALs, the cache kernel is fault-tolerant because it is protected from the rest of the operating system (and applications), it is replicated in large-scale configurations and it includes audit and recovery mechanisms. A cache kernel has been implemented on a scalable shared-memory and networked multi-computer [1] hardware which provides architectural support for the cache kernel approach.Fig 1 illustrates a typical target configuration. There is an instance of the cache kernel per multi-processor module (MPM), each managing the processors, second-level cache and network interface of that MPM. The cache kernel executes out of PROM and local memory of the MPM, making it hardware-independent of the rest of the system except for power. That is, the separate cache kernels and MPMs fail independently. Operating system services are provided by application kernels, server kernels and conventional operating system emulation kernels in conjunction with privileged MPM resource managers (MRM) that execute on top of the cache kernel. These kernels may be in separate protected address spaces or a shared library within a sophisticated application address space. A system bus connects the MPMs to each other and the memory modules. A high-speed network interface per MPM connects this node to file servers and other similarly configured processing nodes. This overall design can be simplified for real-time applications and similar restricted scenarios. For example, with relatively static partitioning of resources, an embedded real-time application could be structured as one or more application spaces incorporating application kernels as shared libraries executing directly on top of the cache kernel.

Patent
Srinivas Raman1
15 Apr 1994
TL;DR: In this paper, the cache coherency is maintained by writing back all modified data in the cache prior to execution of the command that initiate the DMA or Bus Master transfer to or from main memory.
Abstract: A system and method for guaranteeing coherency between a write back cache and main memory in a computer system that does not have the bus level signals for a conventional write back cache memory. Cache coherency can be maintained by writing back all modified data in the cache prior to execution of the command that initiate the DMA or Bus Master transfer to or from main memory. When bus snooping logic detects writes from the CPU, the cache and main memory are synchronized. After synchronization, the bus snooper continues to look for access hits to modified data in the cache. If hits occurs and it is a DMA cycle, the CPU is prevented from further accesses to cache until after the DMA transfer, modified bytes are written back to main memory. If it is a bus master device seeking access to main memory, the CPU is prevented from further accesses to cache until the modified bytes are written back to main memory.

Patent
30 Sep 1994
TL;DR: In this paper, a data cache unit is employed within a microprocessor capable of speculative and out-of-order processing of memory instructions, where each microprocessor is capable of snooping the cache lines of data cache units of each other microprocessor.
Abstract: The data cache unit includes a separate fill buffer and a separate write-back buffer. The fill buffer stores one or more cache lines for transference into data cache banks of the data cache unit. The write-back buffer stores a single cache line evicted from the data cache banks prior to write-back to main memory. Circuitry is provided for transferring a cache line from the fill buffer into the data cache banks while simultaneously transferring a victim cache line from the data cache banks into the write-back buffer. Such allows the overall replace operation to be performed in only a single clock cycle. In a particular implementation, the data cache unit is employed within a microprocessor capable of speculative and out-of-order processing of memory instructions. Moreover, the microprocessor is incorporated within a multiprocessor computer system wherein each microprocessor is capable of snooping the cache lines of data cache units of each other microprocessor. The data cache unit is also a non-blocking cache.

Proceedings ArticleDOI
18 Apr 1994
TL;DR: A decoupled sectored cache will allow the same level of performance as a non-sectored cache, but at a significantly lower hardware cost.
Abstract: Sectored caches have been used for many years in order to reconcile low tag array size and small or medium block size. In a sectored cache, a single address tag is associated with a sector consisting on several cache lines, while validity, dirty and coherency tags are associated with each of the inner cache lines. Usually in a cache, a cache line location is statically linked to one and only one address tag word location. In the decoupled sectored cache introduced in the paper, this monolithic association is broken; the address tag location associated with a cache line location is dynamically chosen at fetch time among several possible locations. The tag volume on a decoupled sectored cache is in the same range as the tag volume in a traditional sectored cache; but the hit ratio on a decoupled sectored cache is very close to the hit ratio on a non-sectored cache. A decoupled sectored cache will allow the same level of performance as a non-sectored cache, but at a significantly lower hardware cost. >

Patent
29 Jun 1994
TL;DR: A master-slave cache system as discussed by the authors uses a set-associative master cache and two smaller direct-mapped slave caches, a slave instruction cache for supplying instructions to an instruction pipeline of a processor, and a slave data cache for providing data operands to an execution pipeline of the processor.
Abstract: A master-slave cache system has a large, set-associative master cache, and two smaller direct-mapped slave caches, a slave instruction cache for supplying instructions to an instruction pipeline of a processor, and a slave data cache for supplying data operands to an execution pipeline of the processor. The master cache and the slave caches are tightly coupled to each other. This tight coupling allows the master cache to perform most cache management operations for the slave caches, freeing the slave caches to supply a high bandwidth of instructions and operands to the processor's pipelines. The master cache contains tags that include valid bits for each slave, allowing the master cache to determine if a line is present and valid in either of the slave caches without interrupting the slave caches. The master cache performs all search operations required by external snooping, cache invalidation, cache data zeroing instructions, and store-to-instruction-stream detection. The master cache interrupts the slave caches only when the search reveals that a line is valid in a slave cache, the master cache causing the slave cache to invalidate the line. A store queue is shared between the master cache and the slave data cache. Store data is written from the store queue directly in to both the slave data cache and the master cache, eliminating the need for the slave data cache to write data through to the master cache. The master-slave cache system also eliminates the need for a second set of address tags for snooping and coherency operations. The master cache can be large and designed for a low miss rate, while the slave caches are designed for the high speed required by the processor's pipelines.

Proceedings ArticleDOI
01 Apr 1994
TL;DR: The decoupled sectored cache introduced in this paper will allow the same level of performance as a non-sectored cache, but at a significantly lower hardware cost.
Abstract: Sectored caches have been used for many years in order to reconcile low tag array size and small or medium block size In a sectored cache, a single address tag is associated with a sector consisting on several cache lines, while validity, dirty and coherency tags are associated with each of the inner cache linesMaintaining a low tag array size is a major issue in many cache designs (eg L2 caches) Using a sectored cache is a design trade-off between a low size of the tag array which is possible with large line size and a low memory traffic which requires a small line sizeThis technique has been used in many cache designs including small on-chip microprocessor caches and large external second level caches Unfortunately, as on some applications, the miss ratio on a sectored cache is significantly higher than the miss ratio on a non-sectored cache (factors higher than two are commonly observed), a significant part of the potential performance may be wasted in miss penaltiesUsually in a cache, a cache line location is statically linked to one and only one address tag word location In the decoupled sectored cache we introduce in this paper, this monolithic association is broken; the address tag location associated with a cache line location is dynamically chosen at fetch time among several possible locationsThe tag volume on a decoupled sectored cache is in the same range as the tag volume in a traditional sectored cache; but the hit ratio on a decoupled sectored cache is very close to the hit ratio on a non-sectored cache A decoupled sectored cache will allow the same level of performance as a non-sectored cache, but at a significantly lower hardware cost

Patent
Konrad K. Lai1
23 Mar 1994
TL;DR: In this paper, a multi-level memory system is provided having a primary cache and a secondary cache in which unnecessary swapping operations are minimized, and the secondary cache responds to the request.
Abstract: A multi-level memory system is provided having a primary cache and a secondary cache in which unnecessary swapping operations are minimized. If a memory access request misses in the primary cache, but hits in the secondary cache, then the secondary cache responds to the request. If, however, the request also misses in the secondary cache, but is found in main memory, then main memory responds to the request. In responding to the request, the secondary cache or main memory returns the requested data to the primary cache. If an address tag of a primary cache victim line does not match an address tag in the secondary cache or the primary cache victim line is dirty, then the victim is stored in the secondary cache. The primary cache victim line includes a first bit for indicating whether the address tag of the primary cache victim line matches an address tag of the secondary cache.