scispace - formally typeset
Search or ask a question

Showing papers on "Cache pollution published in 1994"


Proceedings ArticleDOI
01 Apr 1994
TL;DR: The results show that, for the majority of the benchmarks, stream buffers can attain hit rates that are comparable to typical hit rates of secondary caches, and as the data-set size of the scientific workload increases the performance of streams typically improves relative to secondary cache performance, showing that streams are more scalable to large data- set sizes.
Abstract: Today's commodity microprocessors require a low latency memory system to achieve high sustained performance. The conventional high-performance memory system provides fast data access via a large secondary cache. But large secondary caches can be expensive, particularly in large-scale parallel systems with many processors (and thus many caches).We evaluate a memory system design that can be both cost-effective as well as provide better performance, particularly for scientific workloads: a single level of (on-chip) cache backed up only by Jouppi's stream buffers [10] and a main memory. This memory system requires very little hardware compared to a large secondary cache and doesn't require modifications to commodity processors. We use trace-driven simulation of fifteen scientific applications from the NAS and PERFECT suites in our evaluation. We present two techniques to enhance the effectiveness of Jouppi's original stream buffers: filtering schemes to reduce their memory bandwidth requirement and a scheme that enables stream buffers to prefetch data being accessed in large strides. Our results show that, for the majority of our benchmarks, stream buffers can attain hit rates that are comparable to typical hit rates of secondary caches. Also, we find that as the data-set size of the scientific workload increases the performance of streams typically improves relative to secondary cache performance, showing that streams are more scalable to large data-set sizes.

368 citations


Journal ArticleDOI
TL;DR: To mitigate false sharing and to enhance spatial locality, the layout of shared data in cache blocks is optimized in a programmer-transparent manner and it is shown that this approach can reduce the number of misses on shared data by about 10% on average.
Abstract: The performance of the data cache in shared-memory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. Some researchers have speculated that this effect is due to false sharing, the coherence transactions that result when different processors update different words of the same cache block in an interleaved fashion. While the analysis of six applications in the paper confirms that false sharing has a significant impact on the miss rate, the measurements also show that poor spatial locality among accesses to shared data has an even larger impact. To mitigate false sharing and to enhance spatial locality, we optimize the layout of shared data in cache blocks in a programmer-transparent manner. We show that this approach can reduce the number of misses on shared data by about 10% on average. >

265 citations


Proceedings ArticleDOI
07 Dec 1994
TL;DR: This paper describes an approach for bounding the worst-case instruction cache performance of large code segments by using static cache simulation to analyze a program's control flow to statically categorize the caching behavior of each instruction.
Abstract: The use of caches poses a difficult tradeoff for architects of real-time systems. While caches provide significant performance advantages, they have also been viewed as inherently unpredictable, since the behavior of a cache reference depends upon the history of the previous references. The use of caches is only suitable for real-time systems if a reasonably tight bound on the performance of programs using cache memory can be predicted. This paper describes an approach for bounding the worst-case instruction cache performance of large code segments. First, a new method called static cache simulation is used to analyze a program's control flow to statically categorize the caching behavior of each instruction. A timing analyzer, which uses the categorization information, then estimates the worst-case instruction cache performance for each loop and function in the program. >

233 citations


Proceedings Article
12 Sep 1994
TL;DR: It is shown that there are significant benefits in redesigning traditional query processing algorithms so that they can make better use of the cache, and new algorithms run 8%-200% faster than the traditional ones.
Abstract: The current main memory (DRAM) access speeds lag far behind CPU speeds Cache memory, made of static RAM, is being used in today’s architectures to bridge this gap It provides access latencies of 2-4 processor cycles, in contrast to main memory which requires 15-25 cycles Therefore, the performance of the CPU depends upon how well the cache can be utilized We show that there are significant benefits in redesigning our traditional query processing algorithms so that they can make better use of the cache The new algorithms run 8%-200% faster than the traditional ones

215 citations


Proceedings ArticleDOI
01 Apr 1994
TL;DR: Two-level exclusive caching improves the performance of two-level caching organizations by increasing the effective associativity and capacity.
Abstract: The performance of two-level on-chip caching is investigated for a range of technology and architecture assumptions. The area and access time of each level of cache is modeled in detail. The results indicate that for most workloads, two-level cache configurations (with a set-associative second level) perform marginally better than single-level cache configurations that require the same chip area once the first-level cache sizes are 64KB or larger. Two-level configurations become even more important in systems with no off-chip cache and in systems in which the memory cells in the first-level caches are multiported and hence larger than those in the second-level cache. Finally, a new replacement policy called two-level exclusive caching is introduced. Two-level exclusive caching improves the performance of two-level caching organizations by increasing the effective associativity and capacity.

195 citations


Journal ArticleDOI
01 Nov 1994
TL;DR: Using trace-driven simulation of applications and the operating system, it is shown that a CML buffer enables a large direct-mapped cache to perform nearly as well as a two-way set associative cache of equivalent size and speed, although with lower hardware cost and complexity.
Abstract: This paper describes a method for improving the performance of a large direct-mapped cache by reducing the number of conflict misses. Our solution consists of two components: an inexpensive hardware device called a Cache Miss Lookaside (CML) buffer that detects conflicts by recording and summarizing a history of cache misses, and a software policy within the operating system's virtual memory system that removes conflicts by dynamically remapping pages whenever large numbers of conflict misses are detected. Using trace-driven simulation of applications and the operating system, we show that a CML buffer enables a large direct-mapped cache to perform nearly as well as a two-way set associative cache of equivalent size and speed, although with lower hardware cost and complexity.

187 citations


Patent
Alexander D. Peleg1, Uri Weiser1
30 Mar 1994
TL;DR: In this paper, the authors propose an improved cache and organization particularly suitable for superscalar architectures, where the cache is organized around trace segments of running programs rather than an organization based on memory addresses.
Abstract: An improved cache and organization particularly suitable for superscalar architectures. The cache is organized around trace segments of running programs rather than an organization based on memory addresses. A single access to the cache memory may cross virtual address line boundaries. Branch prediction is integrally incorporated into the cache array permitting the crossing of branch boundaries with a single access.

186 citations


Patent
08 Mar 1994
TL;DR: A second level cache memory controller, implemented as an integrated circuit unit, operates in conjunction with a secondary random access cache memory and a main memory (system) bus controller to form a second-level cache memory subsystem.
Abstract: A second level cache memory controller, implemented as an integrated circuit unit, operates in conjunction with a secondary random access cache memory and a main memory (system) bus controller to form a second level cache memory subsystem. The subsystem is interfaced to the local processor (CPU) bus and to the main memory bus providing independent access by both buses, thereby reducing traffic of the main memory bus when the data required by the CPU is located in secondary cache. Similarly, CPU bus traffic is minimized when secondary cache access by the main memory bus for snoops and write-backs to main memory. Snoop latches interfaced with the main memory bus provide snoop access to the cache memory via the cache directory in the secondary cache controller unit. The controller also supports parallel look-up in the controller tag array and the secondary cache using most-recently-used (MRU) main memory write-through and pipelining of memory bus cycle requests.

159 citations


Proceedings Article
06 Jun 1994
TL;DR: The main contribution of this paper is the solution to the allocation problem, which allows processes to manage their own cache blocks, while at the same time maintains the dynamic allocation of cache blocks among processes.
Abstract: We consider how to improve the performance of file caching by allowing user-level control over file cache replacement decisions. We use two-level cache management: the kernel allocates physical pages to individual applications (allocation), and each application is responsible for deciding how to use its physical pages (replacement). Previous work on two-level memory management has focused on replacement, largely ignoring allocation. The main contribution of this paper is our solution to the allocation problem. Our solution allows processes to manage their own cache blocks, while at the same time maintains the dynamic allocation of cache blocks among processes. Our solution makes sure that good user-level policies can improve the file cache hit ratios of the entire system over the existing replacement approach. We evaluate our scheme by trace-based simulation, demonstrating that it leads to significant improvements in hit ratios for a variety of applications.

138 citations


Patent
16 May 1994
TL;DR: In this article, a cache memory is organized in multiple levels, each level having multiple entries, the entries of each level receiving information of a predetermined category, each entry being accessible independently, links are defined between entries of one level of the cache memory and entries at another level of cache memory, the links corresponding to information relationships specified by a user of information stored in the secondary storage.
Abstract: Information needed by application programs from a secondary storage is cached in a cache memory which is organized in multiple levels, each level having multiple entries, the entries of each level receiving information of a predetermined category, each entry being accessible independently. Links are defined between entries of one level of the cache memory and entries at another level of the cache memory, the links corresponding to information relationships specified by a user of information stored in the secondary storage. In response to a request to a file system from an application for needed information, the needed information is fetched into the cache, and in connection with fetching the needed information, other information is prefetched from the system of files which is not immediately needed. Quotas are established on information which may be fetched from a secondary storage into the cache, the quotas being applicable to file contents within a file and to the number of files within a directory. Upon a request from an application program to open a file, an entry is created in a cache corresponding to the file, the entry including file header information. The file header information is retained in the cache so long as the file remains open, whether or not any file contents of the file remain in cache. The entries of three level of the cache respectively receive directory information, file header information, and file contents information.

134 citations


Patent
26 Apr 1994
TL;DR: In this article, a hierarchical memory system is provided which includes a cache and long-term storage, in which an address of a requested data block is translated to a second addressing scheme, and is meshed, so that proximate data blocks are placed on different physical target disks within the longterm storage.
Abstract: A data processing system (10) has a processor with a processor memory and a mechanism for specifying an address that corresponds to a processor-requested data block located within another memory to be accessed by the processor. A hierarchical memory system is provided which includes a cache (16) and long-term storage (20). In accordance with a mapping and meshing process performed by a memory subsystem (22), an address of a requested data block is translated to a second addressing scheme, and is meshed, so that proximate data blocks are placed on different physical target disks within the long-term storage. In accordance with a cache drain mechanism, the cache will drain data from the cache to the physical target disks under different specified conditions. A further mechanism is provided for preserving data within the cache that is frequently accessed by the requester processor. A user-configuration mechanism is provided.


Patent
Akihisa Fujimoto1
22 Mar 1994
TL;DR: In this article, a frame buffer cache is arranged to store part of image data in an image memory so that a CPU and a drawing processor can perform image data read/write operations by only accessing the frame buffer.
Abstract: A frame buffer cache is arranged to store part of image data in an image memory so that a CPU and a drawing processor can perform image data read/write operations by only accessing the frame buffer cache. Therefore, the image data read/write operations of the CPU and the drawing processor can be performed simultaneously with the access to a dual port image memory, thus improving the drawing performance of the CPU and the drawing processor.

Patent
20 Jun 1994
TL;DR: In this paper, a cache storage drawer containing a plurality of DASD devices for implementing a RAID parity data protection scheme, and permanently storing data, is coupled with a cache controller.
Abstract: A system and method for reducing device wait time in response to a host initiated write operation modifying a data block. The system includes a host computer channel connected to a storage controller which has cache memory and a nonvolatile storage buffer in a first embodiment. An identical system makes up the second embodiment with the exception that there is no nonvolatile storage buffer in the storage controller of the second embodiment. The controller in either embodiment is coupled to a cache storage drawer containing a plurality of DASD devices for implementing a RAID parity data protection scheme, and for permanently storing data. The drawer has nonvolatile cache memory which is used for accepting data destaged from controller cache. In a first embodiment, no commit reply is sent to the controller to indicate that data has been written to DASD. Instead a status information block is created to indicate that the data has been destaged from controller cache but is not committed. The status information is stored in directory means attached to the controller. The system uses this information to create a list of data which is in the state of Not committed. In this way data can be committed according to a cache management algorithm of least recently used (LRU), rather than requiring synchronous commit which is inefficient because it requires waiting on a commit response and ties up nonvolatile storage space allocated to back-up copies of cache data. In a second embodiment, directory means attached to the controller stores information about status blocks that may be modified or unmodified. The status information is used to eliminate wait times associated with waiting for data to be written to HDAs below.

Proceedings ArticleDOI
28 Feb 1994
TL;DR: A new processor implementing Hewlett-Packard's PA-RISC 1.1 (Precision Architecture) has been designed, which incorporates many improvements over the HP PA7100 CPU, including increased frequency, instruction and data cache prefetching, enhanced superscalar execution, and enhanced multiprocessor support.
Abstract: A new processor implementing Hewlett-Packard's PA-RISC 1.1 (Precision Architecture) has been designed. This latest design incorporates many improvements over the HP PA7100 CPU, including increased frequency, instruction and data cache prefetching, enhanced superscalar execution, and enhanced multiprocessor support. The PA7200 connects directly to a new split transaction, 120 MHz, 64-bit bus capable of supporting multiple processors and multiple outstanding memory reads per processor. A novel fully associative on-chip data cache, which is accessed in parallel with an external data cache, is used to reduce the miss rate and facilitate hardware and software directed prefetching to reduce average memory access time. >

Patent
30 Dec 1994
TL;DR: In this article, the cache memory space in a computer system is controlled on a dynamic basis by adjusting the low threshold which triggers the release of more cache free space and the high threshold which ceases the free space.
Abstract: The cache memory space in a computer system is controlled on a dynamic basis by adjusting the low threshold which triggers the release of more cache free space and by adjusting the high threshold which ceases the release of free space The low and high thresholds are predicted based on the number of allocations which are accomplished in response to I/O requests, and based on the number of blockages which occur when an allocation can not be accomplished The predictions may be based on weighted values of different historical time periods, and the high and low thresholds may be made equal to one another In this manner the performance degradation resulting from variations in workload caused by prior art fixed or static high and low thresholds is avoided Instead only a predicted amount of cache memory space is freed and that amount of free space is more likely to accommodate the predicted output requests without releasing so much cache space that an unacceptable number of blockages occur

Patent
12 Dec 1994
TL;DR: In this paper, a cache indexer maintains a current index of data elements which are stored in cache memory and a sequential data access indicator, responsive to the cache index and to a user selectable sequential access threshold, determines that a sequential access is in progress for a given process and provides an indication of the same.
Abstract: A cache management system and method monitors and controls the contents of cache memory coupled to at least one host and at least one data storage device. A cache indexer maintains a current index of data elements which are stored in cache memory. A sequential data access indicator, responsive to the cache index and to a user selectable sequential data access threshold, determines that a sequential data access is in progress for a given process and provides an indication of the same. The system and method allocate a micro-cache memory to any process performing a sequential data access. In response to the indication of a sequential data access in progress and to a user selectable maximum number of data elements to be prefetched, a data retrieval requestor requests retrieval of up to the selected maximum number of data elements from a data storage device. A user selectable number of sequential data elements determines when previously used micro-cache memory locations will be overwritten. A method of dynamically monitoring and adjusting cache management parameters is also presented.

Patent
23 Dec 1994
TL;DR: In this paper, a column-associative cache that reduces conflict misses, increases the hit rate and maintains a minimum hit access time is proposed, where the cache lines represent a column of sets.
Abstract: A column-associative cache that reduces conflict misses, increases the hit rate and maintains a minimum hit access time. The column-associative cache indexes data from a main memory into a plurality of cache lines according to a tag and index field through hash and rehash functions. The cache lines represent a column of sets. Each cache line contains a rehash block indicating whether the set is a rehash location. To increase the performance of the column-associative cache, a content addressable memory (CAM) is used to predict future conflict misses.

Patent
13 Jul 1994
TL;DR: In this article, an analytical model for calculating cache hit rate for combinations of data sets and LRU sizes is presented. But the model is not directly applied in software for constructing a precise model that can be used to predict cache hit rates for a cache, using statistics accumulated for each element independently.
Abstract: Method and structure for collecting statistics for quantifying locality of data and thus selecting elements to be cached, and then calculating the overall cache hit rate as a function of cached elements. LRU stack distance has a straight-forward probabilistic interpretation and is part of statistics to quantify locality of data for each element considered for caching. Request rates for additional slots in the LRU are a function of file request rate and LRU size. Cache hit rate is a function of locality of data and the relative request rates for data sets. Specific locality parameters for each data set and arrival rate of requests for data-sets are used to produce an analytical model for calculating cache hit rate for combinations of data sets and LRU sizes. This invention provides algorithms that can be directly implemented in software for constructing a precise model that can be used to predict cache hit rates for a cache, using statistics accumulated for each element independently. The model can rank the elements to find the best candidates for caching. Instead of considering the cache as a whole, the average arrival rates and re-reference statistics for each element are estimated, and then used to consider various combinations of elements and cache sizes in predicting the cache hit rate. Cache hit rate is directly calculated using the to-be-cached files' arrival rates and re-reference statistics and used to rank the elements to find the set that produces the optimal cache hit rate.

Proceedings ArticleDOI
12 Sep 1994
TL;DR: In this article, the cache kernel is proposed to provide a hardware adaptation layer to operating system services rather than just providing a key subset of OS services, as has been the common approach in previous microkernel work.
Abstract: Operating system design has had limited success in providing adequate application functionality and a poor record in avoiding excessive growth in size and complexity, especially with protected operating systems. Applications require far greater control over memory, I/O and processing resources to meet their requirements. For example, database transaction processing systems include their own "kernel" which can much better manage resources for the application than can the application-ignorant general-purpose conventional operating system mechanisms. Large-scale parallel applications have similar requirements. The same requirements arise with servers implemented outside the operating system kernel.In our research, we have been exploring the approach of making the operating system kernel a cache for active operating systems objects such as processes, address spaces and communication channels, rather than a complete manager of these objects. The resulting system is smaller than recent so-called microkernels, and also provides greater flexibility for applications, including real-time applications, database management systems and large-scale simulations. As part of this research, we have developed what we call a cache kernel, a new generation of microkernel that supports operating system configurations across these dimensions.The cache kernel can also be regarded as providing a hardware adaptation layer (HAL) to operating system services rather than trying to just provide a key subset of OS services, as has been the common approach in previous microkernel work. However, in contrast to conventional HALs, the cache kernel is fault-tolerant because it is protected from the rest of the operating system (and applications), it is replicated in large-scale configurations and it includes audit and recovery mechanisms. A cache kernel has been implemented on a scalable shared-memory and networked multi-computer [1] hardware which provides architectural support for the cache kernel approach.Fig 1 illustrates a typical target configuration. There is an instance of the cache kernel per multi-processor module (MPM), each managing the processors, second-level cache and network interface of that MPM. The cache kernel executes out of PROM and local memory of the MPM, making it hardware-independent of the rest of the system except for power. That is, the separate cache kernels and MPMs fail independently. Operating system services are provided by application kernels, server kernels and conventional operating system emulation kernels in conjunction with privileged MPM resource managers (MRM) that execute on top of the cache kernel. These kernels may be in separate protected address spaces or a shared library within a sophisticated application address space. A system bus connects the MPMs to each other and the memory modules. A high-speed network interface per MPM connects this node to file servers and other similarly configured processing nodes. This overall design can be simplified for real-time applications and similar restricted scenarios. For example, with relatively static partitioning of resources, an embedded real-time application could be structured as one or more application spaces incorporating application kernels as shared libraries executing directly on top of the cache kernel.

Patent
Srinivas Raman1
15 Apr 1994
TL;DR: In this paper, the cache coherency is maintained by writing back all modified data in the cache prior to execution of the command that initiate the DMA or Bus Master transfer to or from main memory.
Abstract: A system and method for guaranteeing coherency between a write back cache and main memory in a computer system that does not have the bus level signals for a conventional write back cache memory. Cache coherency can be maintained by writing back all modified data in the cache prior to execution of the command that initiate the DMA or Bus Master transfer to or from main memory. When bus snooping logic detects writes from the CPU, the cache and main memory are synchronized. After synchronization, the bus snooper continues to look for access hits to modified data in the cache. If hits occurs and it is a DMA cycle, the CPU is prevented from further accesses to cache until after the DMA transfer, modified bytes are written back to main memory. If it is a bus master device seeking access to main memory, the CPU is prevented from further accesses to cache until the modified bytes are written back to main memory.

Patent
01 Nov 1994
TL;DR: In this article, a filtered stream buffer coupled with a memory and a processor is proposed to prefetch data from the memory, where the filter controller determines whether a pattern of references has a predetermined relationship, and if so, prefetches stream data into the cache block storage area.
Abstract: Method and apparatus for a filtered stream buffer coupled to a memory and a processor, and operating to prefetch data from the memory. The filtered stream buffer includes a cache block storage area and a filter controller. The filter controller determines whether a pattern of references has a predetermined relationship, and if so, prefetches stream data into the cache block storage area. Such stream data prefetches are particularly useful in vector processing computers, where once the processor starts to fetch a vector, the addresses of future fetches can be predicted based in the pattern of past fetches. According to various aspects of the present invention, the filtered stream buffer further includes a history table, a validity indicator which is associated with the cache block storage area and indicates which cache blocks, if any, are valid. According to yet another aspect of the present invention, the filtered stream buffer controls random access memory (RAM) chips to stream the plurality of consecutive cache blocks from the RAM into the cache block storage area. According to yet another aspect of the present invention, the stream data includes data for a plurality of strided cache blocks, wherein each of which these strided cache blocks corresponds to an address determined by adding to the first address an integer multiple of the difference between the second address and the first address. According to yet another aspect of the present invention, the processor generates three addresses of data words in the memory, and the filter controller determines whether a predetermined relationship exists among three addresses, and if so, prefetches strided stream data into said cache block storage area.

Patent
15 Nov 1994
TL;DR: In this paper, a cache subsystem for a computer system having a processor and a main memory is described, which includes a prefetch buffer coupled to the processor and the main memory.
Abstract: A cache subsystem for a computer system having a processor and a main memory is described. The cache subsystem includes a prefetch buffer coupled to the processor and the main memory. The prefetch buffer stores a first data prefetched from the main memory in accordance with a predicted address for a next memory fetch by the processor. The predicted address is based upon an address for a last memory fetch from the processor. A main cache is coupled to the processor and the main memory. The main cache is not coupled to the prefetch buffer and does not receive data from the prefetch buffer. The main cache stores a second data fetched from the main memory in accordance with the address for the last memory fetch by the processor only if the address for the last memory fetch is an unpredictable address. The address for the last memory fetch is the unpredictable address if both of the prefetch buffer and the main cache do not contain the address and the second data associated with the address.

Patent
30 Sep 1994
TL;DR: In this paper, a data cache unit is employed within a microprocessor capable of speculative and out-of-order processing of memory instructions, where each microprocessor is capable of snooping the cache lines of data cache units of each other microprocessor.
Abstract: The data cache unit includes a separate fill buffer and a separate write-back buffer. The fill buffer stores one or more cache lines for transference into data cache banks of the data cache unit. The write-back buffer stores a single cache line evicted from the data cache banks prior to write-back to main memory. Circuitry is provided for transferring a cache line from the fill buffer into the data cache banks while simultaneously transferring a victim cache line from the data cache banks into the write-back buffer. Such allows the overall replace operation to be performed in only a single clock cycle. In a particular implementation, the data cache unit is employed within a microprocessor capable of speculative and out-of-order processing of memory instructions. Moreover, the microprocessor is incorporated within a multiprocessor computer system wherein each microprocessor is capable of snooping the cache lines of data cache units of each other microprocessor. The data cache unit is also a non-blocking cache.

Proceedings ArticleDOI
18 Apr 1994
TL;DR: A decoupled sectored cache will allow the same level of performance as a non-sectored cache, but at a significantly lower hardware cost.
Abstract: Sectored caches have been used for many years in order to reconcile low tag array size and small or medium block size. In a sectored cache, a single address tag is associated with a sector consisting on several cache lines, while validity, dirty and coherency tags are associated with each of the inner cache lines. Usually in a cache, a cache line location is statically linked to one and only one address tag word location. In the decoupled sectored cache introduced in the paper, this monolithic association is broken; the address tag location associated with a cache line location is dynamically chosen at fetch time among several possible locations. The tag volume on a decoupled sectored cache is in the same range as the tag volume in a traditional sectored cache; but the hit ratio on a decoupled sectored cache is very close to the hit ratio on a non-sectored cache. A decoupled sectored cache will allow the same level of performance as a non-sectored cache, but at a significantly lower hardware cost. >

Patent
29 Jun 1994
TL;DR: A master-slave cache system as discussed by the authors uses a set-associative master cache and two smaller direct-mapped slave caches, a slave instruction cache for supplying instructions to an instruction pipeline of a processor, and a slave data cache for providing data operands to an execution pipeline of the processor.
Abstract: A master-slave cache system has a large, set-associative master cache, and two smaller direct-mapped slave caches, a slave instruction cache for supplying instructions to an instruction pipeline of a processor, and a slave data cache for supplying data operands to an execution pipeline of the processor. The master cache and the slave caches are tightly coupled to each other. This tight coupling allows the master cache to perform most cache management operations for the slave caches, freeing the slave caches to supply a high bandwidth of instructions and operands to the processor's pipelines. The master cache contains tags that include valid bits for each slave, allowing the master cache to determine if a line is present and valid in either of the slave caches without interrupting the slave caches. The master cache performs all search operations required by external snooping, cache invalidation, cache data zeroing instructions, and store-to-instruction-stream detection. The master cache interrupts the slave caches only when the search reveals that a line is valid in a slave cache, the master cache causing the slave cache to invalidate the line. A store queue is shared between the master cache and the slave data cache. Store data is written from the store queue directly in to both the slave data cache and the master cache, eliminating the need for the slave data cache to write data through to the master cache. The master-slave cache system also eliminates the need for a second set of address tags for snooping and coherency operations. The master cache can be large and designed for a low miss rate, while the slave caches are designed for the high speed required by the processor's pipelines.

Proceedings ArticleDOI
01 Apr 1994
TL;DR: The decoupled sectored cache introduced in this paper will allow the same level of performance as a non-sectored cache, but at a significantly lower hardware cost.
Abstract: Sectored caches have been used for many years in order to reconcile low tag array size and small or medium block size In a sectored cache, a single address tag is associated with a sector consisting on several cache lines, while validity, dirty and coherency tags are associated with each of the inner cache linesMaintaining a low tag array size is a major issue in many cache designs (eg L2 caches) Using a sectored cache is a design trade-off between a low size of the tag array which is possible with large line size and a low memory traffic which requires a small line sizeThis technique has been used in many cache designs including small on-chip microprocessor caches and large external second level caches Unfortunately, as on some applications, the miss ratio on a sectored cache is significantly higher than the miss ratio on a non-sectored cache (factors higher than two are commonly observed), a significant part of the potential performance may be wasted in miss penaltiesUsually in a cache, a cache line location is statically linked to one and only one address tag word location In the decoupled sectored cache we introduce in this paper, this monolithic association is broken; the address tag location associated with a cache line location is dynamically chosen at fetch time among several possible locationsThe tag volume on a decoupled sectored cache is in the same range as the tag volume in a traditional sectored cache; but the hit ratio on a decoupled sectored cache is very close to the hit ratio on a non-sectored cache A decoupled sectored cache will allow the same level of performance as a non-sectored cache, but at a significantly lower hardware cost

Patent
Konrad K. Lai1
23 Mar 1994
TL;DR: In this paper, a multi-level memory system is provided having a primary cache and a secondary cache in which unnecessary swapping operations are minimized, and the secondary cache responds to the request.
Abstract: A multi-level memory system is provided having a primary cache and a secondary cache in which unnecessary swapping operations are minimized. If a memory access request misses in the primary cache, but hits in the secondary cache, then the secondary cache responds to the request. If, however, the request also misses in the secondary cache, but is found in main memory, then main memory responds to the request. In responding to the request, the secondary cache or main memory returns the requested data to the primary cache. If an address tag of a primary cache victim line does not match an address tag in the secondary cache or the primary cache victim line is dirty, then the victim is stored in the secondary cache. The primary cache victim line includes a first bit for indicating whether the address tag of the primary cache victim line matches an address tag of the secondary cache.

Proceedings ArticleDOI
01 Apr 1994
TL;DR: This paper investigates the architecture and partitioning of resources between processors and cache memory for single chip and MCM-based multiprocessors, and shows that for parallel applications, clustering via shared caches provides an effective mechanism for increasing the total number of processors in a system.
Abstract: In the near future, semiconductor technology will allow the integration of multiple processors on a chip or multichip-module (MCM). In this paper we investigate the architecture and partitioning of resources between processors and cache memory for single chip and MCM-based multiprocessors. We study the performance of a cluster-based multiprocessor architecture in which processors within a cluster are tightly coupled via a shared cluster cache for various processor-cache configurations. Our results show that for parallel applications, clustering via shared caches provides an effective mechanism for increasing the total number of processors in a system, without increasing the number of invalidations. Combining these results with cost estimates for shared cluster cache implementations leads to two conclusions: 1) For a four cluster multiprocessor with single chip clusters, two processors per cluster with a smaller cache provides higher performance and better cost/performance than a single processor with a larger cache and 2) this four cluster configuration can be scaled linearly in performance by adding processors to each cluster using MCM packaging techniques.

Patent
24 Mar 1994
TL;DR: In this article, the authors present a computing system which includes a memory, an input/output adapter and a processor, and the processor includes a write back cache in which dirty data may be stored.
Abstract: A computing system is presented which includes a memory, an input/output adapter and a processor. The processor includes a write back cache in which dirty data may be stored. When performing a coherent write from the input/output adapter to the memory, a block of data is written from the input/output adapter to a memory location within the memory. The block of data contains less data than a full cache line in the write back cache. The write back cache is searched to determine whether the write back cache contains data for the memory location. When the search determines that the write back cache contains data for the memory location a full cache line which contains the data for the memory location is purged.