scispace - formally typeset
Search or ask a question

Showing papers on "Cache invalidation published in 1995"


Proceedings Article
01 Jan 1995
TL;DR: In this paper, a taxonomy of different cache invalidation strategies and study the impact of client's disconnection times on their performance is presented, and the authors determine that for the units which are often disconnected (sleepers) the best cache invalidization strategy is based on signatures previously used for efficient file comparison.
Abstract: In the mobile wireless computing environment of the future a large number of users equipped with low powered palm-top machines will query databases over the wireless communication channels. Palmtop based units will often be disconnected for prolonged periods of time due to the battery power saving measures; palmtops will also frequencly relocate between different cells and connect to different data servers at different times. Caching of frequently accessed data items will be an important technique that will reduce contention on the narrow bandwidth wireless channel. However, cache invalidation strategies will be severely affected by the disconnection and mobility of the clients. The server may no longer know which clients are currently residing under its cell and which of them are currently on. We propose a taxonomy of different cache invalidation strategies and study the impact of client's disconnection times on their performance. We determine that for the units which are often disconnected (sleepers) the best cache invalidation strategy is based on signatures previously used for efficient file comparison. On the other hand, for units which are connected most of the time (workaholics), the best cache invalidation strategy is based on the periodic broadcast of changed data items.

509 citations


18 Jul 1995
TL;DR: This work assesses the potential of proxy servers to cache documents retrieved with the HTTP protocol, and finds that a proxy server really functions as a second level cache, and its hit rate may tend to decline with time after initial loading given a more or less constant set of users.
Abstract: As the number of World-Wide Web users grow, so does the number of connections made to servers. This increases both network load and server load. Caching can reduce both loads by migrating copies of server files closer to the clients that use those files. Caching can either be done at a client or in the network (by a proxy server or gateway). We assess the potential of proxy servers to cache documents retrieved with the HTTP protocol. We monitored traffic corresponding to three types of educational workloads over a one semester period, and used this as input to a cache simulation. Our main findings are (1) that with our workloads a proxy has a 30-50% maximum possible hit rate no matter how it is designed; (2) that when the cache is full and a document is replaced, least recently used (LRU) is a poor policy, but simple variations can dramatically improve hit rate and reduce cache size; (3) that a proxy server really functions as a second level cache, and its hit rate may tend to decline with time after initial loading given a more or less constant set of users; and (4) that certain tuning configuration parameters for a cache may have little benefit.

495 citations


Proceedings ArticleDOI
01 Jun 1995
TL;DR: This paper presents a new algorithm for choosing problem-size dependent tile sizes based on the cache size and cache line size for a direct-mapped cache that eliminates both capacity and self-interference misses and reduces cross-Interference misses.
Abstract: When dense matrix computations are too large to fit in cache, previous research proposes tiling to reduce or eliminate capacity misses. This paper presents a new algorithm for choosing problem-size dependent tile sizes based on the cache size and cache line size for a direct-mapped cache. The algorithm eliminates both capacity and self-interference misses and reduces cross-interference misses. We measured simulated miss rates and execution times for our algorithm and two others on a variety of problem sizes and cache organizations. At higher set associativity, our algorithm does not always achieve the best performance. However on direct-mapped caches, our algorithm improves simulated miss rates and measured execution times when compared with previous work.

434 citations



Proceedings ArticleDOI
01 May 1995
TL;DR: The results show that DSI reduces execution time of a sequentially consistent full-map coherence protocol by as much as 41%.
Abstract: This paper introduces dynamic self-invalidation (DSI), a new technique for reducing cache coherence overhead in shared-memory multiprocessors. DSI eliminates invalidation messages by having a processor automatically invalidate its local copy of a cache block before a conflicting access by another processor. Eliminating invalidation overhead is particularly important under sequential consistency, where the latency of invalidating outstanding copies can increase a program's critical path.DSI is applicable to software, hardware, and hybrid coherence schemes. In this paper we evaluate DSI in the context of hardware directory-based write-invalidate coherence protocols. Our results show that DSI reduces execution time of a sequentially consistent full-map coherence protocol by as much as 41%. This is comparable to an implementation of weak consistency that uses a coalescing write-buffer to allow up to 16 outstanding requests for exclusive blocks. When used in conjunction with weak consistency, DSI can exploit tear-off blocks---which eliminate both invalidation and acknowledgment messages---for a total reduction in messages of up to 26%.

216 citations


Journal ArticleDOI
01 Oct 1995
TL;DR: A taxonomy of different cache invalidation strategies is proposed, and the impact of clients' disconnection times on their performance is studied to improve further the efficiency of the invalidation techniques described.
Abstract: In the mobile wireless computing environment of the future, a large number of users, equipped with low-powered palmtop machines, will query databases over wireless communication channels. Palmtop-based units will often be disconnected for prolonged periods of time, due to battery power saving measures; palmtops also will frequently relocate between different cells, and will connect to different data servers at different times. Caching of frequently accessed data items will be an important technique that will reduce contention on the narrowbandwidth, wireless channel. However, cache individualization strategies will be severely affected by the disconnection and mobility of the clients. The server may no longer know which clients are currently residing under its cell, and which of them are currently on. We propose a taxonomy of different cache invalidation strategies, and study the impact of clients' disconnection times on their performance. We study ways to improve further the efficiency of the invalidation techniques described. We also describe how our techniques can be implemented over different network environments.

212 citations


Patent
23 May 1995
TL;DR: In this paper, an apparatus and method to adjust the caching strategy used for writing its write request data to storage media during execution of various software applications is presented. But, the cache flushing parameter is transferred from the host computer to a controller which has a cache memory, and a quantity of query data is written from the cache memory to a storage medium in accordance with the cache-flushing parameter.
Abstract: An apparatus and method is disclosed which enables a host computer to adjust the caching strategy used for writing its write request data to storage media during execution of various software applications. The method includes the step of generating a caching-flushing parameter in the host computer. The cache flushing parameter is then transferred from the host computer to a controller which has a cache memory. Thereafter, a quantity of write request data is written from the cache memory to a storage medium in accordance with the cache-flushing parameter.

194 citations


Patent
07 Jun 1995
TL;DR: In this article, a method and system for retrieving and maintaining presentation data in a presentation cache is presented, where the presentation cache object can return presentation data to a requesting client program even if the server program implementing the source object is unavailable or not running.
Abstract: A method and system for retrieving and maintaining presentation data in a presentation cache is provided. In a preferred embodiment, a presentation cache object provides a presentation cache with multiple cache entries. Each entry contains an indication of the format of the presentation data and the presentation data stored in that particular format. In addition, other information, such as the advisees of advisory connections for notification of cache updates, can be maintained. The presentation cache object responds to requests for retrieving source object data by returning presentation data cached within the presentation cache when it is available. In addition, the presentation cache object determines when it should delegate requests to the source object and when it can satisfy them on its own. The presentation cache object can return presentation data to a requesting client program even if the server program implementing the source object is unavailable or not running. The presentation cache object can also choose to persistently store its cache entries so that the presentation cache is maintained when the source object is closed. In addition, client programs can provide control over the frequency and subject of presentation data updates within the presentation cache.

140 citations


Journal ArticleDOI
01 Nov 1995
TL;DR: A method to maintain predictability of execution time within preemptive, cached real-time systems is introduced and the impact on compilation support for such a system is discussed.
Abstract: Cache memories have become an essential part of modern processors to bridge the increasing gap between fast processors and slower main memory. Until recently, cache memories were thought to impose unpredictable execution time behavior for hard real-time systems. But recent results show that the speedup of caches can be exploited without a significant sacrifice of predictability. These results were obtained under the assumption that real-time tasks be scheduled non-preemptively.This paper introduces a method to maintain predictability of execution time within preemptive, cached real-time systems and discusses the impact on compilation support for such a system. Preemptive systems with caches are made predictable via software-based cache partitioning. With this approach, the cache is divided into distinct portions associated with a real-time task, such that a task may only use its portion. The compiler has to support instruction and data partitioning for each task. Instruction partitioning involves non-linear control-flow transformations, while data partitioning involves code transformations of data references. The impact on execution time of these transformations is also discussed.

139 citations


Patent
31 Aug 1995
TL;DR: In this article, a data cache configured to perform store accesses in a single clock cycle is provided, where the data cache speculatively stores data within a predicted way of the cache after capturing the data currently being stored in that predicted way.
Abstract: A data cache configured to perform store accesses in a single clock cycle is provided. The data cache speculatively stores data within a predicted way of the cache after capturing the data currently being stored in that predicted way. During a subsequent clock cycle, the cache hit information for the store access validates the way prediction. If the way prediction is correct, then the store is complete. If the way prediction is incorrect, then the captured data is restored to the predicted way. If the store access hits in an unpredicted way, the store data is transferred into the correct storage location within the data cache concurrently with the restoration of data in the predicted storage location. Each store for which the way prediction is correct utilizes a single clock cycle of data cache bandwidth. Additionally, the way prediction structure implemented within the data cache bypasses the tag comparisons of the data cache to select data bytes for the output. Therefore, the access time of the associative data cache may be substantially similar to a direct-mapped cache access time. The present data cache is therefore suitable for high frequency superscalar microprocessors.

114 citations


Patent
13 Nov 1995
TL;DR: In this paper, a cache line is merged with the cache line prior to storage in the cache and other matching entries become active and are allowed to reaccess the cache (71).
Abstract: A data processor (40) keeps track of misses to a cache (71) so that multiple misses within the same cache line can be merged or folded at reload time. A load/store unit (60) includes a completed store queue (61) for presenting store requests to the cache (71) in order. If a store request misses in the cache (71), the completed store queue (61) requests the cache line from a lower-level memory system (90) and thereafter inactivates the store request. When a reload cache line is received, the completed store queue (61) compares the reload address to all entries. If at least one address matches the reload address, one entry's data is merged with the cache line prior to storage in the cache (71). Other matching entries become active and are allowed to reaccess the cache (71). A miss queue (80) coupled between the load/store unit (60) and the lower-level memory system (90) implements reload folding to improve efficiency.

Patent
13 Oct 1995
TL;DR: In this paper, an adaptive read ahead cache is provided with a real cache and a virtual cache, where the real cache has a data buffer, an address buffer, and a status buffer.
Abstract: An adaptive read ahead cache is provided with a real cache and a virtual cache. The real cache has a data buffer, an address buffer, and a status buffer. The virtual cache contains only an address buffer and a status buffer. Upon receiving an address associated with the consumer's request, the cache stores the address in the virtual cache address buffer if the address is not found in the real cache address buffer and the virtual cache address buffer. Further, the cache fills the real cache data buffer with data responsive to the address from said memory if the address is found only in the virtual cache address buffer. The invention thus loads data into the cache only when sequential accesses are occurring and minimizes the overhead of unnecessarily filling the real cache when the host is accessing data in a random access mode.

20 Nov 1995
TL;DR: The technique of static cache simulation is shown to address the issue of predicting cache behavior, contrary to the belief that cache memories introduce unpredictability to real-time systems that cannot be efficiently analyzed.
Abstract: This work takes a fresh look at the simulation of cache memories. It introduces the technique of static cache simulation that statically predicts a large portion of cache references. To efficiently utilize this technique, a method to perform efficient on-the-fly analysis of programs in general is developed and proved correct. This method is combined with static cache simulation for a number of applications. The application of fast instruction cache analysis provides a new framework to evaluate instruction cache memories that outperforms even the fastest techniques published. Static cache simulation is shown to address the issue of predicting cache behavior, contrary to the belief that cache memories introduce unpredictability to real-time systems that cannot be efficiently analyzed. Static cache simulation for instruction caches provides a large degree of predictability for real-time systems. In addition, an architectural modification through bit-encoding is introduced that provides fully predictable caching behavior. Even for regular instruction caches without architectural modifications, tight bounds for the execution time of real-time programs can be derived from the information provided by the static cache simulator. Finally, the debugging of real-time applications can be enhanced by displaying the timing information of the debugged program at breakpoints. The timing information is determined by simulating the instruction cache behavior during program execution and can be used, for example, to detect missed deadlines and locate time-consuming code portions. Overall, the technique of static cache simulation provides a novel approach to analyze cache memories and has been shown to be very efficient for numerous applications.

Patent
31 Mar 1995
TL;DR: In this paper, a multiprocessor computer system is provided having a multiplicity of sub-systems and a main memory coupled to a system controller, each of which includes a master interface having master classes for sending memory transaction requests to the system controller.
Abstract: A multiprocessor computer system is provided having a multiplicity of sub-systems and a main memory coupled to a system controller. An interconnect module, interconnects the main memory and sub-systems in accordance with interconnect control signals received from the system controller. At least two of the sub-systems are data processors, each having a respective cache memory that stores multiple blocks of data and a respective master cache index. Each master cache index has a set of master cache tags (Etags), including one cache tag for each data block stored by the cache memory. Each data processor includes a master interface having master classes for sending memory transaction requests to the system controller. The system controller includes memory transaction request logic for processing each memory transaction request by a data processor. The system controller maintains a duplicate cache index having a set of duplicate cache tags (Dtags) for each data processor. Each data processor has a writeback buffer for storing the data block previously stored in a victimized cache line until its respective writeback transaction is completed and an Nth+1 Dtag for storing the cache state of a cache line associated with a read transaction which is executed prior to an associated writeback transaction of a read-writeback transaction pair. Accordingly, upon a cache miss, the interconnect may execute the read and writeback transactions in parallel relying on the writeback buffer or Nth+1 Dtag to accommodate any ordering of the transactions.

Proceedings ArticleDOI
01 Dec 1995
TL;DR: This paper presents a latency-hiding compiler technique that is applicable to general-purpose C programs that 'preloads' the data that are likely to cause a cache-miss before they are used, and thereby hiding the cache miss latency.
Abstract: Previous research on hiding memory latencies has tended to focus on regular numerical programs. This paper presents a latency-hiding compiler technique that is applicable to general-purpose C programs. By assuming a lock-up free cache and instruction score-boarding, our technique 'preloads' the data that are likely to cause a cache-miss before they are used, and thereby hiding the cache miss latency. We have developed simple compiler heuristics to identify load instructions that are likely to cause a cache-miss. Experimentation with a set of SPEC92 benchmarks shows that our heuristics are successful in identifying 85% of cache misses. We have also developed an algorithm that flexibly schedules the selected load instruction and instructions that use the loaded data to hide memory latency. Our simulation suggests that our technique is successful in hiding memory latency and improves the overall performance.

Patent
31 Mar 1995
TL;DR: In this article, a multiprocessor computer system has a multiplicity of sub-systems and a main memory coupled to a system controller, and the system controller maintains a set of duplicate cache tags (Dtags) for each data processor.
Abstract: A multiprocessor computer system has a multiplicity of sub-systems and a main memory coupled to a system controller. An interconnect module, interconnects the main memory and sub-systems in accordance with interconnect control signals received from the system controller. All of the sub-systems include a port that transmits and receives data as data packets of a fixed size. At least two of the sub-systems are data processors, each having a respective cache memory and a respective set of master cache tags (Etags), including one cache tag for each data block stored by the cache memory. The system controller maintains a set of duplicate cache tags (Dtags) for each of the data processors. The data processors each include master cache logic for updating the master cache tags, while the system controller includes logic for updating the duplicate cache tags. Memory transaction request logic simultaneously looks up the second cache tag in each of the sets of duplicate cache tags corresponding to the memory transaction request. It then determines which one of the cache memories and main memory to couple to the requesting data processor based on the second cache states and the address tags stored in the corresponding second cache tags. Duplicate cache update logic simultaneously updates all of the corresponding second cache tags in accordance with predefined cache tag update criteria.

Patent
03 Nov 1995
TL;DR: In this paper, a doubly linked list is used to track the most recently used channels and the corresponding entry is moved to the top of the list as cached channel information is accessed, and the bottom pointer points to the channel data to be removed from the cache.
Abstract: An on-chip cache memory is used to provide a high speed access mechanism to frequently used channel state information for operation of a DMA device that supports multiple virtual channels in a high speed network interface. When an access to a particular channel state is performed, e.g., by a host processor or the DMA device, the cache is first accessed and if the state information is not located currently in the cache, external memory is read and the state information is written to the cache. As the cache does not store all the states stored in external memory, replacement algorithms are utilize to determine which channel state information to remove from the cache in order to provide room to store a recently accessed channel. A doubly linked list is used to track the most recently used channel. As cached channel information is accessed, the corresponding entry is moved to the top of the list. The doubly linked list provides a rapid apparatus and method for updating pointers to the cache. Top and bottom pointers are maintained, pointing to the most recently used and least recently used channels. When a channel is used, it moved to the top of the list. When channel data is moved from external memory to the cache, the bottom pointer points to the channel data to be removed from the cache.

Patent
18 Dec 1995
TL;DR: In this article, an x86 microprocessor system with a process identification system which stores a number assigned to each process run by the microprocessor systems and associates this number with instructions, data, and information fetched and stored in a cache or translation lookaside buffer (TLB) during the execution of the process is described.
Abstract: An x86 microprocessor system with a process identification system which stores a number assigned to each process run by the microprocessor system and associates this number with instructions, data, and information fetched and stored in a cache or translation lookaside buffer (TLB) during the execution of the process. Upon a process or context switch, the instructions, data, and information are not automatically flushed from the cache and TLB. The instructions, data, and information are replaced only when instructions, data, and information for a new process require the same cache memory locations or the same TLB memory location. The cache and TLB may include a valid bit block and a tag block that includes memory locations for storing the pertinent process identification number for each entry. The cache, which may be a set associative cache, and TLB include logic for comparing a process identification number stored in a process identification register with the process identification number stored in the tag block.

Proceedings ArticleDOI
04 Jan 1995
TL;DR: Experimental results suggest that both the block buffering and Gray code addressing techniques are ideal for instruction cache designs which tend to be accessed in a consecutive sequence and can achieve an order of magnitude energy reduction on caches.
Abstract: Caches usually consume a significant amount of energy in modern microprocessors (eg superpipelined or superscalar processors) In this paper, we examine contemporary cache design techniques and provide an analytical model for estimating cache energy consumption We also present several novel techniques for designing energy-efficient caches, which include block buffering, cache sub-banking, and Gray code addressing Experimental results suggest that both the block buffering and Gray code addressing techniques are ideal for instruction cache designs which tend to be accessed in a consecutive sequence Cache sub-banking is ideal for both instruction and data caches Overall, these techniques can achieve an order of magnitude energy reduction on caches >

Patent
07 Jul 1995
TL;DR: In this paper, the PCI-bus controller receives a request from a PCIbus master to transfer data with an address in secondary memory, and the controller performs an initial inquire cycle and withholds TRDY# to the master until any write-back cycle completes.
Abstract: When a PCI-bus controller receives a request from a PCI-bus master to transfer data with an address in secondary memory, the controller performs an initial inquire cycle and withholds TRDY# to the PCI-bus master until any write-back cycle completes. The controller then allows the burst access to take place between secondary memory and the PCI-bus master, and simultaneously and predictively, performs an inquire cycle of the L1 cache for the next cache line. In this manner, if the PCI burst continues past the cache line boundary, the new inquire cycle will already have taken place, or will already be in progress, thereby allowing the burst to proceed with, at most, a short delay. Predictive snoop cycles are not performed if the first transfer of a PCI-bus master access would be the last transfer before a cache line boundary is reached.

Patent
Akio Shigeeda1
15 Mar 1995
TL;DR: In this paper, an electronic device for use in a computer system, and having a small second-level write-back cache, is disclosed, where the device may be implemented into a single integrated circuit, as a microprocessor unit, to include a micro processor core, a memory controller circuit, and first and second level caches.
Abstract: An electronic device for use in a computer system, and having a small second-level write-back cache, is disclosed. The device may be implemented into a single integrated circuit, as a microprocessor unit, to include a microprocessor core, a memory controller circuit, and first and second level caches. In a system implementation, the device is connected to external dynamic random access memory (DRAM). The first level cache is a write-through cache, while the second level cache is a write-back cache that is much smaller than the first level cache. In operation, a write access that is a cache hit in the second level cache writes to the second level cache, rather than to DRAM, thus saving a wait state. A dirty bit is set for each modified entry in the second level cache. Upon the second level cache being full of modified data, a cache flush to DRAM is automatically performed. In addition, each entry of the second level cache is flushed to DRAM upon each of its byte locations being modified. The computer system may also include one or more additional integrated circuit devices, such as a direct memory access (DMA) circuit and a bus bridge interface circuit for bidirectional communication with the microprocessor unit. The microprocessor unit may also include handshaking control to prohibit configuration register updating when a memory access is in progress or is imminent. The disclosed microprocessor unit also includes circuitry for determining memory bank size and memory address type.

Patent
31 Aug 1995
TL;DR: In this article, a superscalar microprocessor employing a way prediction structure is provided, which predicts a way of an associative cache in which an access will hit, and causes the data bytes from the predicted way to be conveyed as the output of the cache.
Abstract: A superscalar microprocessor employing a way prediction structure is provided. The way prediction structure predicts a way of an associative cache in which an access will hit, and causes the data bytes from the predicted way to be conveyed as the output of the cache. The typical tag comparisons to the request address are bypassed for data byte selection, causing the access time of the associative cache to be substantially the access time of the direct-mapped way prediction array within the way prediction structure. Also included in the way prediction structure is a way prediction control unit configured to update the way prediction array when an incorrect way prediction is detected. The clock cycle of the superscalar microprocessor including the way prediction structure with its caches may be increased if the cache access time is limiting the clock cycle. Additionally, the associative cache may be retained in the high frequency superscalar microprocessor (which might otherwise employ a direct-mapped cache for access time reasons). Single clock cycle cache access to an associative data cache is maintained for high frequency operation.

Proceedings ArticleDOI
01 May 1995
TL;DR: This paper presents the design and evaluation of a fast address generation mechanism capable of eliminating the delays caused by effective address calculation for many loads and stores, and responds well to software support, in many cases providing better program speedups and reducing cache bandwidth requirements.
Abstract: For many programs, especially integer codes, untolerated load instruction latencies account for a significant portion of total execution time. In this paper, we present the design and evaluation of a fast address generation mechanism capable of eliminating the delays caused by effective address calculation for many loads and stores.Our approach works by predicting early in the pipeline (part of) the effective address of a memory access and using this predicted address to speculatively access the data cache. If the prediction is correct, the cache access is overlapped with non-speculative effective address calculation. Otherwise, the cache is accessed again in the following cycle, this time using the correct effective address. The impact on the cache access critical path is minimal; the prediction circuitry adds only a single OR operation before cache access can commence. In addition, verification of the predicted effective address is completely decoupled from the cache access critical path.Analyses of program reference behavior and subsequent performance analysis of this approach shows that this design is a good one, servicing enough accesses early enough to result in speedups for all the programs we tested. Our approach also responds well to software support, which can significantly reduce the number of mispredicted effective addresses, in many cases providing better program speedups and reducing cache bandwidth requirements.

Patent
Michael Kagan1, David Perlmutter1
10 May 1995
TL;DR: In this article, a multiprocessor computer system which maintains cache coherency includes first and second microprocessors each having an associated cache memory storing lines of data, each line of data has associated protocol bits that indicate a protocol state consistent with write-through, write-back, or write-once cache cohemrency policies that are selected via a protocol selection terminal for different system configurations.
Abstract: A multiprocessor computer system which maintains cache coherency includes first and second microprocessors each having an associated cache memory storing lines of data. Each line of data has associated protocol bits that indicate a protocol state consistent with write-through, write-back, or write-once cache coherency policies that are selected via a protocol selection terminal for different system configurations. In one configuration, the output and external address terminals of the first microprocessor are coupled to the external and output address terminals, respectively, of the second microprocessor. This configuration enables each microprocessor to snoop memory cycles to main memory initiated by the other microprocessor so that it can be readily determined if a particular cache has the latest version of data.

Book
01 Jan 1995
TL;DR: It is found that the best solutions to the cache-coherence problem result from a synergy between a multiprocessor's software and hardware components.
Abstract: The usefulness of shared-data caches in large-scale multiprocessors, the relative merits of different coherence schemes, and system-level methods for improving directory efficiency are addressed. The research presented is part of an effort to build a high-performance, large-scale multiprocessor. The various classes of cache directory schemes are described, and a method of measuring cache coherence is presented. The various directory schemes are analyzed, and ways of improving the performance of directories are considered. It is found that the best solutions to the cache-coherence problem result from a synergy between a multiprocessor's software and hardware components. >

Patent
Asit Dan1, Dinkar Sitaram1
31 Jul 1995
TL;DR: In this article, a system and method for caching sequential data streams in a cache storage device is presented, where each information stream is made as to whether its data blocks should be discarded from cache as they are read by a consuming process.
Abstract: A system and method for caching sequential data streams in a cache storage device. For each information stream, a determination is made as to whether its data blocks should discarded from cache as they are read by a consuming process. Responsive to a determination that the data blocks of a stream should be discarded from the cache are read by the consuming process, the data blocks associated with that stream are cached in accordance with an interval caching algorithm. Alternatively, responsive to a determination that the data blocks of a stream should not be discarded from the cache storage device as they are read by the consuming process, the data blocks of that stream are cached in accordance with a segment caching algorithm.

Proceedings ArticleDOI
22 Jan 1995
TL;DR: This paper characterizes in detail the locality patterns of the operating system code and shows that there is substantial locality, and proposes an algorithm to expose these localities and reduce interference.
Abstract: High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to use an optimizing compiler to minimize cache interference via an improved layout of the code. This technique, however, has been applied to application code only, even though there is evidence that the operating system often uses the cache heavily and with less uniform patterns than applications. Therefore, it is unknown how well existing optimizations perform for systems code and whether better optimizations can be found. We address this problem in this paper. This paper characterizes in detail the locality patterns of the operating system code and shows that there is substantial locality. Unfortunately, caches are not able to extract much of it: rarely-executed special-case code disrupts spatial locality, loops with few iterations that call routines make loop locality hard to exploit, and plenty of loop-less code hampers temporal locality. As a result, interference within popular execution paths dominates instruction cache misses. Based on our observations, we propose an algorithm to expose these localities and reduce interference. For a range of cache sizes, associativities, lines sizes, and other organizations we show that we reduce total instruction miss rates by 31-86% (up to 2.9 absolute points). Using a simple model this corresponds to execution time reductions in the order of 12-26%. In addition, our optimized operating system combines well with optimized or unoptimized applications. >

Patent
23 Oct 1995
TL;DR: In this article, a two-level cache data structure and associated methods are implemented with a RAID controller to reduce the overhead of the RAID controller in determining which blocks are present in the lower level cache.
Abstract: Methods and associated data structures operable in a RAID subsystem to improve I/O performance. A two level cache data structure and associated methods are implemented with a RAID controller. The lower level cache comprises buffers holding recently utilized blocks of the disk devices. The upper level cache records which blocks are present in the lower level cache for each stripe in the RAID level 5 configuration. The upper level cache serves to reduce the overhead processing required of the RAID controller to determine which blocks are present in the lower level cache. Having more rapid access to this information by lowering the processing overhead enables the present invention to rapidly select between different write techniques to post data and error blocks from low level cache to the disk array. A RMW write technique is used to post data and error checking blocks to disk when insufficient information reside in the lower level cache. A faster Full Write technique (also referred to as Stripe Write) is used to post data and error checking blocks to disk when all required, related blocks are resident in the lower level cache. The Full Write technique reduces the total number of I/O operations required of the disk devices to post the update as compared to the RMW technique. The two level cache of the present invention enables a rapid selection between the RMW and Full Write techniques.

Patent
18 Apr 1995
TL;DR: In this article, the authors propose a self-recovery mechanism for errors in the associated cache directory or the shared cache itself by invalidating all the entries in the cache directory of the accessed congruence class by resetting Valid bits to "0" and setting the Parity bit to a correct value.
Abstract: A high available shared cache memory in a tightly coupled multiprocessor system provides an error self-recovery mechanism for errors in the associated cache directory or the shared cache itself. After an error in a congruence class of the cache is indicated by an error status register, self-recovery is accomplished by invalidating all the entries in the shared cache directory means of the accessed congruence class by resetting Valid bits to '0' and by setting the Parity bit to a correct value, wherein the request for data to the main memory is not cancelled. Multiple bit failures in the cached data are recovered by setting the Valid bit in the matching column to '0'. The processor reissues the request for data, which is loaded into the processor's private cache and the shared cache as well. Further requests to this data by other processors are served by the shared cache.

Patent
31 Oct 1995
TL;DR: In this article, the authors present a method and apparatus for controlling multiple cache memories with a single cache controller using a processor to control the operation of its on-chip level one cache memory and a level two cache memory.
Abstract: A method and apparatus for controlling multiple cache memories with a single cache controller. The present invention uses a processor to control the operation of its on-chip level one (L1) cache memory and a level two (L2) cache memory. In this manner, the processor is able to send operations to be performed to the L2 cache memory, such as writing state and/or cache line status to the L2 cache memory. A dedicated bus is coupled between dice. This dedicated bus is used to send control and other signals between the processor and the L2 cache memory.