scispace - formally typeset
Search or ask a question

Showing papers on "Smart Cache published in 1994"


Journal ArticleDOI
TL;DR: This article describes a caching strategy that offers the performance of caches twice its size and investigates three cache replacement algorithms: random replacement, least recently used, and a frequency-based variation of LRU known as segmented LRU (SLRU).
Abstract: I/O subsystem manufacturers attempt to reduce latency by increasing disk rotation speeds, incorporating more intelligent disk scheduling algorithms, increasing I/O bus speed, using solid-state disks, and implementing caches at various places in the I/O stream. In this article, we examine the use of caching as a means to increase system response time and improve the data throughput of the disk subsystem. Caching can help to alleviate I/O subsystem bottlenecks caused by mechanical latencies. This article describes a caching strategy that offers the performance of caches twice its size. After explaining some basic caching issues, we examine some popular caching strategies and cache replacement algorithms, as well as the advantages and disadvantages of caching at different levels of the computer system hierarchy. Finally, we investigate the performance of three cache replacement algorithms: random replacement (RR), least recently used (LRU), and a frequency-based variation of LRU known as segmented LRU (SLRU). >

325 citations


Journal ArticleDOI
TL;DR: To mitigate false sharing and to enhance spatial locality, the layout of shared data in cache blocks is optimized in a programmer-transparent manner and it is shown that this approach can reduce the number of misses on shared data by about 10% on average.
Abstract: The performance of the data cache in shared-memory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. Some researchers have speculated that this effect is due to false sharing, the coherence transactions that result when different processors update different words of the same cache block in an interleaved fashion. While the analysis of six applications in the paper confirms that false sharing has a significant impact on the miss rate, the measurements also show that poor spatial locality among accesses to shared data has an even larger impact. To mitigate false sharing and to enhance spatial locality, we optimize the layout of shared data in cache blocks in a programmer-transparent manner. We show that this approach can reduce the number of misses on shared data by about 10% on average. >

265 citations


Journal ArticleDOI
TL;DR: It is shown that cache profiling, using the CProf cache profiling system, improves program performance by focusing a programmer's attention on problematic code sections and providing insight into appropriate program transformations.
Abstract: A vital tool-box component, the CProf cache profiling system lets programmers identify hot spots by providing cache performance information at the source-line and data-structure level. Our purpose is to introduce a broad audience to cache performance profiling and tuning techniques. Although used sporadically in the supercomputer and multiprocessor communities, these techniques also have broad applicability to programs running on fast uniprocessor workstations. We show that cache profiling, using our CProf cache profiling system, improves program performance by focusing a programmer's attention on problematic code sections and providing insight into appropriate program transformations. >

242 citations


Proceedings ArticleDOI
07 Dec 1994
TL;DR: This paper describes an approach for bounding the worst-case instruction cache performance of large code segments by using static cache simulation to analyze a program's control flow to statically categorize the caching behavior of each instruction.
Abstract: The use of caches poses a difficult tradeoff for architects of real-time systems. While caches provide significant performance advantages, they have also been viewed as inherently unpredictable, since the behavior of a cache reference depends upon the history of the previous references. The use of caches is only suitable for real-time systems if a reasonably tight bound on the performance of programs using cache memory can be predicted. This paper describes an approach for bounding the worst-case instruction cache performance of large code segments. First, a new method called static cache simulation is used to analyze a program's control flow to statically categorize the caching behavior of each instruction. A timing analyzer, which uses the categorization information, then estimates the worst-case instruction cache performance for each loop and function in the program. >

233 citations


Proceedings Article
12 Sep 1994
TL;DR: It is shown that there are significant benefits in redesigning traditional query processing algorithms so that they can make better use of the cache, and new algorithms run 8%-200% faster than the traditional ones.
Abstract: The current main memory (DRAM) access speeds lag far behind CPU speeds Cache memory, made of static RAM, is being used in today’s architectures to bridge this gap It provides access latencies of 2-4 processor cycles, in contrast to main memory which requires 15-25 cycles Therefore, the performance of the CPU depends upon how well the cache can be utilized We show that there are significant benefits in redesigning our traditional query processing algorithms so that they can make better use of the cache The new algorithms run 8%-200% faster than the traditional ones

215 citations


Proceedings ArticleDOI
01 Apr 1994
TL;DR: Two-level exclusive caching improves the performance of two-level caching organizations by increasing the effective associativity and capacity.
Abstract: The performance of two-level on-chip caching is investigated for a range of technology and architecture assumptions. The area and access time of each level of cache is modeled in detail. The results indicate that for most workloads, two-level cache configurations (with a set-associative second level) perform marginally better than single-level cache configurations that require the same chip area once the first-level cache sizes are 64KB or larger. Two-level configurations become even more important in systems with no off-chip cache and in systems in which the memory cells in the first-level caches are multiported and hence larger than those in the second-level cache. Finally, a new replacement policy called two-level exclusive caching is introduced. Two-level exclusive caching improves the performance of two-level caching organizations by increasing the effective associativity and capacity.

195 citations


Journal ArticleDOI
01 Nov 1994
TL;DR: Using trace-driven simulation of applications and the operating system, it is shown that a CML buffer enables a large direct-mapped cache to perform nearly as well as a two-way set associative cache of equivalent size and speed, although with lower hardware cost and complexity.
Abstract: This paper describes a method for improving the performance of a large direct-mapped cache by reducing the number of conflict misses. Our solution consists of two components: an inexpensive hardware device called a Cache Miss Lookaside (CML) buffer that detects conflicts by recording and summarizing a history of cache misses, and a software policy within the operating system's virtual memory system that removes conflicts by dynamically remapping pages whenever large numbers of conflict misses are detected. Using trace-driven simulation of applications and the operating system, we show that a CML buffer enables a large direct-mapped cache to perform nearly as well as a two-way set associative cache of equivalent size and speed, although with lower hardware cost and complexity.

187 citations


Proceedings Article
06 Jun 1994
TL;DR: The main contribution of this paper is the solution to the allocation problem, which allows processes to manage their own cache blocks, while at the same time maintains the dynamic allocation of cache blocks among processes.
Abstract: We consider how to improve the performance of file caching by allowing user-level control over file cache replacement decisions. We use two-level cache management: the kernel allocates physical pages to individual applications (allocation), and each application is responsible for deciding how to use its physical pages (replacement). Previous work on two-level memory management has focused on replacement, largely ignoring allocation. The main contribution of this paper is our solution to the allocation problem. Our solution allows processes to manage their own cache blocks, while at the same time maintains the dynamic allocation of cache blocks among processes. Our solution makes sure that good user-level policies can improve the file cache hit ratios of the entire system over the existing replacement approach. We evaluate our scheme by trace-based simulation, demonstrating that it leads to significant improvements in hit ratios for a variety of applications.

138 citations


Proceedings ArticleDOI
14 Nov 1994
TL;DR: It is demonstrated that for applications that do not perform well under traditional caching policies, the combination of good application-chosen replacement strategies, and the kernel allocation policy LRU-SP, can reduce the number of block I/Os and reduce the elapsed time by up to 45%.
Abstract: Traditional file system implementations do not allow applications to control file caching replacement decisions. We have implemented two-level replacement, a scheme that allows applications to control their own cache replacement, while letting the kernel control the allocation of cache space among processes. We designed an interface to let applications exert control on replacement via a set of directives to the kernel. This is effective and requires low overhead.We demonstrate that for applications that do not perform well under traditional caching policies, the combination of good application-chosen replacement strategies, and our kernel allocation policy LRU-SP, can reduce the number of block I/Os by up to 80%, and can reduce the elapsed time by up to 45%. We also show that LRU-SP is crucial to the performance improvement for multiple concurrent applications: LRU-SP fairly distributes cache blocks and offers protection against foolish applications.

136 citations


Patent
26 Apr 1994
TL;DR: In this article, a hierarchical memory system is provided which includes a cache and long-term storage, in which an address of a requested data block is translated to a second addressing scheme, and is meshed, so that proximate data blocks are placed on different physical target disks within the longterm storage.
Abstract: A data processing system (10) has a processor with a processor memory and a mechanism for specifying an address that corresponds to a processor-requested data block located within another memory to be accessed by the processor. A hierarchical memory system is provided which includes a cache (16) and long-term storage (20). In accordance with a mapping and meshing process performed by a memory subsystem (22), an address of a requested data block is translated to a second addressing scheme, and is meshed, so that proximate data blocks are placed on different physical target disks within the long-term storage. In accordance with a cache drain mechanism, the cache will drain data from the cache to the physical target disks under different specified conditions. A further mechanism is provided for preserving data within the cache that is frequently accessed by the requester processor. A user-configuration mechanism is provided.

133 citations


Patent
23 Dec 1994
TL;DR: In this paper, a column-associative cache that reduces conflict misses, increases the hit rate and maintains a minimum hit access time is proposed, where the cache lines represent a column of sets.
Abstract: A column-associative cache that reduces conflict misses, increases the hit rate and maintains a minimum hit access time. The column-associative cache indexes data from a main memory into a plurality of cache lines according to a tag and index field through hash and rehash functions. The cache lines represent a column of sets. Each cache line contains a rehash block indicating whether the set is a rehash location. To increase the performance of the column-associative cache, a content addressable memory (CAM) is used to predict future conflict misses.

Patent
13 Jul 1994
TL;DR: In this article, an analytical model for calculating cache hit rate for combinations of data sets and LRU sizes is presented. But the model is not directly applied in software for constructing a precise model that can be used to predict cache hit rates for a cache, using statistics accumulated for each element independently.
Abstract: Method and structure for collecting statistics for quantifying locality of data and thus selecting elements to be cached, and then calculating the overall cache hit rate as a function of cached elements. LRU stack distance has a straight-forward probabilistic interpretation and is part of statistics to quantify locality of data for each element considered for caching. Request rates for additional slots in the LRU are a function of file request rate and LRU size. Cache hit rate is a function of locality of data and the relative request rates for data sets. Specific locality parameters for each data set and arrival rate of requests for data-sets are used to produce an analytical model for calculating cache hit rate for combinations of data sets and LRU sizes. This invention provides algorithms that can be directly implemented in software for constructing a precise model that can be used to predict cache hit rates for a cache, using statistics accumulated for each element independently. The model can rank the elements to find the best candidates for caching. Instead of considering the cache as a whole, the average arrival rates and re-reference statistics for each element are estimated, and then used to consider various combinations of elements and cache sizes in predicting the cache hit rate. Cache hit rate is directly calculated using the to-be-cached files' arrival rates and re-reference statistics and used to rank the elements to find the set that produces the optimal cache hit rate.

Patent
30 Sep 1994
TL;DR: In this paper, a data cache unit is employed within a microprocessor capable of speculative and out-of-order processing of memory instructions, where each microprocessor is capable of snooping the cache lines of data cache units of each other microprocessor.
Abstract: The data cache unit includes a separate fill buffer and a separate write-back buffer. The fill buffer stores one or more cache lines for transference into data cache banks of the data cache unit. The write-back buffer stores a single cache line evicted from the data cache banks prior to write-back to main memory. Circuitry is provided for transferring a cache line from the fill buffer into the data cache banks while simultaneously transferring a victim cache line from the data cache banks into the write-back buffer. Such allows the overall replace operation to be performed in only a single clock cycle. In a particular implementation, the data cache unit is employed within a microprocessor capable of speculative and out-of-order processing of memory instructions. Moreover, the microprocessor is incorporated within a multiprocessor computer system wherein each microprocessor is capable of snooping the cache lines of data cache units of each other microprocessor. The data cache unit is also a non-blocking cache.

Proceedings ArticleDOI
18 Apr 1994
TL;DR: A decoupled sectored cache will allow the same level of performance as a non-sectored cache, but at a significantly lower hardware cost.
Abstract: Sectored caches have been used for many years in order to reconcile low tag array size and small or medium block size. In a sectored cache, a single address tag is associated with a sector consisting on several cache lines, while validity, dirty and coherency tags are associated with each of the inner cache lines. Usually in a cache, a cache line location is statically linked to one and only one address tag word location. In the decoupled sectored cache introduced in the paper, this monolithic association is broken; the address tag location associated with a cache line location is dynamically chosen at fetch time among several possible locations. The tag volume on a decoupled sectored cache is in the same range as the tag volume in a traditional sectored cache; but the hit ratio on a decoupled sectored cache is very close to the hit ratio on a non-sectored cache. A decoupled sectored cache will allow the same level of performance as a non-sectored cache, but at a significantly lower hardware cost. >

Patent
29 Jun 1994
TL;DR: A master-slave cache system as discussed by the authors uses a set-associative master cache and two smaller direct-mapped slave caches, a slave instruction cache for supplying instructions to an instruction pipeline of a processor, and a slave data cache for providing data operands to an execution pipeline of the processor.
Abstract: A master-slave cache system has a large, set-associative master cache, and two smaller direct-mapped slave caches, a slave instruction cache for supplying instructions to an instruction pipeline of a processor, and a slave data cache for supplying data operands to an execution pipeline of the processor. The master cache and the slave caches are tightly coupled to each other. This tight coupling allows the master cache to perform most cache management operations for the slave caches, freeing the slave caches to supply a high bandwidth of instructions and operands to the processor's pipelines. The master cache contains tags that include valid bits for each slave, allowing the master cache to determine if a line is present and valid in either of the slave caches without interrupting the slave caches. The master cache performs all search operations required by external snooping, cache invalidation, cache data zeroing instructions, and store-to-instruction-stream detection. The master cache interrupts the slave caches only when the search reveals that a line is valid in a slave cache, the master cache causing the slave cache to invalidate the line. A store queue is shared between the master cache and the slave data cache. Store data is written from the store queue directly in to both the slave data cache and the master cache, eliminating the need for the slave data cache to write data through to the master cache. The master-slave cache system also eliminates the need for a second set of address tags for snooping and coherency operations. The master cache can be large and designed for a low miss rate, while the slave caches are designed for the high speed required by the processor's pipelines.

Proceedings ArticleDOI
01 Apr 1994
TL;DR: The decoupled sectored cache introduced in this paper will allow the same level of performance as a non-sectored cache, but at a significantly lower hardware cost.
Abstract: Sectored caches have been used for many years in order to reconcile low tag array size and small or medium block size In a sectored cache, a single address tag is associated with a sector consisting on several cache lines, while validity, dirty and coherency tags are associated with each of the inner cache linesMaintaining a low tag array size is a major issue in many cache designs (eg L2 caches) Using a sectored cache is a design trade-off between a low size of the tag array which is possible with large line size and a low memory traffic which requires a small line sizeThis technique has been used in many cache designs including small on-chip microprocessor caches and large external second level caches Unfortunately, as on some applications, the miss ratio on a sectored cache is significantly higher than the miss ratio on a non-sectored cache (factors higher than two are commonly observed), a significant part of the potential performance may be wasted in miss penaltiesUsually in a cache, a cache line location is statically linked to one and only one address tag word location In the decoupled sectored cache we introduce in this paper, this monolithic association is broken; the address tag location associated with a cache line location is dynamically chosen at fetch time among several possible locationsThe tag volume on a decoupled sectored cache is in the same range as the tag volume in a traditional sectored cache; but the hit ratio on a decoupled sectored cache is very close to the hit ratio on a non-sectored cache A decoupled sectored cache will allow the same level of performance as a non-sectored cache, but at a significantly lower hardware cost

Patent
Konrad K. Lai1
23 Mar 1994
TL;DR: In this paper, a multi-level memory system is provided having a primary cache and a secondary cache in which unnecessary swapping operations are minimized, and the secondary cache responds to the request.
Abstract: A multi-level memory system is provided having a primary cache and a secondary cache in which unnecessary swapping operations are minimized. If a memory access request misses in the primary cache, but hits in the secondary cache, then the secondary cache responds to the request. If, however, the request also misses in the secondary cache, but is found in main memory, then main memory responds to the request. In responding to the request, the secondary cache or main memory returns the requested data to the primary cache. If an address tag of a primary cache victim line does not match an address tag in the secondary cache or the primary cache victim line is dirty, then the victim is stored in the secondary cache. The primary cache victim line includes a first bit for indicating whether the address tag of the primary cache victim line matches an address tag of the secondary cache.

Proceedings ArticleDOI
01 Apr 1994
TL;DR: This paper investigates the architecture and partitioning of resources between processors and cache memory for single chip and MCM-based multiprocessors, and shows that for parallel applications, clustering via shared caches provides an effective mechanism for increasing the total number of processors in a system.
Abstract: In the near future, semiconductor technology will allow the integration of multiple processors on a chip or multichip-module (MCM). In this paper we investigate the architecture and partitioning of resources between processors and cache memory for single chip and MCM-based multiprocessors. We study the performance of a cluster-based multiprocessor architecture in which processors within a cluster are tightly coupled via a shared cluster cache for various processor-cache configurations. Our results show that for parallel applications, clustering via shared caches provides an effective mechanism for increasing the total number of processors in a system, without increasing the number of invalidations. Combining these results with cost estimates for shared cluster cache implementations leads to two conclusions: 1) For a four cluster multiprocessor with single chip clusters, two processors per cluster with a smaller cache provides higher performance and better cost/performance than a single processor with a larger cache and 2) this four cluster configuration can be scaled linearly in performance by adding processors to each cluster using MCM packaging techniques.

Patent
25 Feb 1994
TL;DR: In this article, a method for handling race conditions arising when multiple processors simultaneously write to a particular cache line is presented, where a determination is made as to whether the cache lines are in an exclusive, modified, invalid, or shared state.
Abstract: In a computer system having a plurality of processors with internal caches, a method for handling race conditions arising when multiple processors simultaneously write to a particular cache line. Initially, a determination is made as to whether the cache line is in an exclusive, modified, invalid, or shared state. If the cache line is in either the exclusive or modified state, the cache line is written to and then set to the modified state. If the cache line is in the invalid state, a Bus-Read-Invalidate operation is performed. However, if the cache line is in the shared state and multiple processors initiate Bus-Write-Invalidate operations, the invalidation request belonging to the first processor is allowed to complete. Thereupon, the cache line is sent to the exclusive state, data is updated, and the cache line is set to the modified state. The second processor receives a second cache line, updates this second cache line, and sets the second cache line to the modified state.

Patent
Ching-Farn E. Wu1
22 Nov 1994
TL;DR: In this article, a two-level virtual/real cache system and a method for detecting and resolving synonyms in the two level virtual and real cache system are described. Butler et al. use a translation lookaside buffer (TLB) for translating virtual to real addresses for accessing the second level real cache.
Abstract: A two-level virtual/real cache system, and a method for detecting and resolving synonyms in the two-level virtual/real cache system, are described. Lines of a first level virtual cache are tagged with a virtual address and a real pointer which points to a corresponding line in a second level real cache. Lines in the second level real cache are tagged with a real address and a virtual pointer which points to a corresponding line in the first level virtual cache, if one exists. A translation-lookaside buffer (TLB) is used for translating virtual to real addresses for accessing the second level real cache. Synonym detection is performed at the second level real cache. An inclusion bit I is set in a directory of the second level real cache to indicate that a particular line is included in the first level virtual cache. Another bit, called a buffer bit B, is set whenever a line in the first level virtual cache is placed in a first level virtual cache writeback buffer for updating main memory. When a first level cache miss occurs, the TLB generates a corresponding real address for that page and the first level virtual cache selects a line for replacement and also notifies the second level real cache which line it chooses for replacement. The real address is then used to access the second level real cache. Synonym detection and resolution are performed by the second level real cache.

Patent
Hoichi Cheong1, Dwain A. Hicks1, Kimming So1
09 Dec 1994
TL;DR: In this paper, the authors present a balanced cache performance in a data processing system consisting of a first processor, a second processor, an intermediate cache memory, and a control circuit.
Abstract: The present invention provides balanced cache performance in a data processing system The data processing system includes a first processor, a second processor, a first cache memory, a second memory and a control circuit The first processor is connected to the first cache memory, which serves as a first level cache for the first processor The second processor and the first cache memory are connected to the second cache memory, which serves as a second level cache for the first processor and as a first level cache for the second processor Replacement of a set in the second cache memory results in the set being invalidated in the first cache memory The control circuit is connected to the second level cache and prevents replacing from a second level cache congruence class all sets that are in the first cache

Patent
22 Feb 1994
TL;DR: In this article, a secondary cache memory system is described for use in a portable computer that increases system performance while also conserving battery life, which includes a cache controller for controlling the transfer to and from a cache memory, comprised of fast SRAM circuits.
Abstract: A secondary cache memory system is disclosed for use in a portable computer that increases system performance while also conserving battery life. The secondary cache includes a cache controller for controlling the transfer to and from a cache memory, comprised of fast SRAM circuits. The cache controller includes a control and status register with at least three status bits to control power to the cache, and to insure that the data stored in the cache memory is coherent with system memory. A control and power management logic checks the contents of the control and status register, and monitors the activity level of the processor. When the processor is determined to be inactive, the control and power management logic turns off the cache by changing the state of a bit in the control and status register. Before doing so, however, the control and power management logic checks the status of a second bit in the control register to determine if some or all of the contents of the cache need to be flushed to system memory. During power up, the control and power management logic checks another status bit in the control register to determine if the contents of the cache is invalid, and if so, clears the cache.

Patent
04 May 1994
TL;DR: In this article, the third level cache system is organized as a writethrough cache, where the shared or exclusive status of any cached data is also stored, and the data is provided directly from the third-level cache without requiring an access to main memory, reducing the use of the host bus.
Abstract: A computer system which utilizes processor boards including a first level cache system integrated with the microprocessor, a second level external cache system and a third level external cache system. The second level cache system is a conventional, high speed, SRAM-based, writeback cache system. The third level cache system is a large, writethrough cache system developed using conventional DRAMs as used in the main memory subsystem of the computer system. The three cache systems are arranged between the CPU and the host bus in a serial fashion. Because of the large size of the third level cache, a high hit rate is developed so that operations are not executed on the host bus but are completed locally on the processor board, reducing the use of the host bus by an individual processor board. This allows additional processor boards to be installed in the computer system without saturating the host bus. The third level cache system is organized as a writethrough cache. However, the shared or exclusive status of any cached data is also stored. If the second level cache performs a write allocate cycle and the data is exclusive in the third level cache, the data is provided directly from the third level cache, without requiring an access to main memory, reducing the use of the host bus.

Patent
Paul Borrill1
09 Mar 1994
TL;DR: In this paper, an improved multiprocessor computer system with an improved snarfing cache is disclosed, which includes a main memory, I/O interface, and a plurality of processor nodes.
Abstract: An improved multiprocessor computer system with an improved snarfing cache is disclosed. The multiprocessor system includes a main memory, I/O interface, and a plurality of processor nodes. Each processor node includes a CPU, and a cache. A shared interconnect couples the main memory, I/O interface, and the plurality of processor nodes. The snarfing cache of each processor node snarfs valid data that appears on the shared interconnect, regardless of whether the cache of the processor node has an invalid copy or no copy of the data. The net effect is that each processor node locally caches additional valid data, resulting in an expected improved cache hit rate, reduced processor latency, and fewer transactions on the shared interconnect.

Proceedings ArticleDOI
28 Sep 1994
TL;DR: This work proposes a new client-side data caching scheme for relational databases with a central server and multiple clients, and examines various performance and optimization issues involved in addressing the questions of cache currency and completeness using predicate descriptions.
Abstract: We propose a new client-side data caching scheme for relational databases with a central server and multiple clients. Data is loaded into a client cache based on queries, which are used to form predicates describing the cache contents. A subsequent query at the client may be satisfied in its local cache if we can determine that the query result is entirely contained in the cache. This issue is called 'cache completeness'. On the other hand, 'cache currency deals with the effect of updates at the central database on the client caches. We examine various performance and optimization issues involved in addressing the questions of cache currency and completeness using predicate descriptions. Expected benefits of our approach over commonly used object 1D-based caching include lower query response times, reduced message traffic, higher server throughput, and better scalability. >

Patent
07 Dec 1994
TL;DR: In this paper, a master-slave cache system has a large master cache and smaller slave caches, including a slave data cache for supplying operands to an execution pipeline of a processor.
Abstract: A master-slave cache system has a large master cache and smaller slave caches, including a slave data cache for supplying operands to an execution pipeline of a processor. The master cache performs all cache coherency operations, freeing the slaves to supply the processor's pipelines at their maximum bandwidth. A store queue is shared between the master cache and the slave data cache. Store data from the processor's execute pipeline is written from the store queue directly into both the master cache and the slave data cache, eliminating the need for the slave data cache to write data back to the master cache. Additionally, fill data from the master cache to the slave data cache is first written to the store queue. This fill data is available for use while in the store queue because the store queue acts as an extension to the slave data cache. Cache operations, diagnostic stores and TLB entries are also loaded into the store queue. A new store or line fill can be merged into an existing store queue entry. Each entry has valid bits for the master cache, the slave data cache, and the slave's tag. Separate byte enables are provided for the master and slave caches, but a single physical address field in each store queue entry is used.

Patent
04 Aug 1994
TL;DR: In this paper, a cache controller searches for the data word in the first level cache in response to a processor attempt to access a data word, and a new data line is fetched from the main memory, replacing the second level victim cache line.
Abstract: A computing system includes a processor, a main memory, a first level cache and a second level cache. The second level cache contains data lines. The first level cache contains data line fragments of data lines within the second level cache. In response to a processor attempt to access a data word, a cache controller searches for the data word in the first level cache. When a first level cache miss results from the attempted access, a search is made for the data word in the second level cache. When a second level cache miss results a new data line, which contains the data word, is fetched from the main memory. Concurrently, the cache controller determines which entries of the first level cache are invalid. Once the new data line is fetched from the main memory, the new data line is placed in the second level cache, replacing the second level victim cache line. In addition, as many data line fragments as possible from the new data line are placed into invalid entries in the first level cache. One of data line fragments from the new data line placed into the first level cache includes the data word.

Proceedings ArticleDOI
Lishing Liu1
30 Nov 1994
TL;DR: This paper investigates the possibility of accurately approximating the results of conventional directory search with faster matches of few partial address bits to optimize cache access timing, particularly in a customized design environment.
Abstract: One critical aspect in designing set-associative cache at high clock rate is deriving timely results from directory lookup. In this paper we investigate the possibility of accurately approximating the results of conventional directory search with faster matches of few partial address bits. Such fast and accurate approximations may be utilized to optimize cache access timing, particularly in a customized design environment. Through analytic and simulation studies we examine the trade-offs of various design choices. We also discuss few other applications of partial address matching to computer designs.

Patent
28 Dec 1994
TL;DR: In this article, a write-back coherency system is proposed to prevent dirty data in the cache from being made dirty while the bus is arbitrated away, and an X%DIRTY latency control function is used to prevent the dirty data from being exported.
Abstract: A write-back coherency system is used, in an exemplary embodiment, to implement write-back caching in an x86 processor installed in a multi-master computer system that does not support a write-back protocol for maintaining coherency between an internal cache and main memory during DMA operations. The write-back coherency system interrupts the normal bus arbitration operation to allow export of dirty data, and includes an X%DIRTY latency-control function. In response to an arbitration-request (such as HOLD), if the internal cache contains dirty data, the processor is inhibited from providing arbitration-acknowledge (such as HLDA) until the dirty data is exported (the cache is dynamically switched to write-through mode to prevent data in the cache from being made dirty while the bus is arbitrated away). While the requesting bus master is accessing memory, bus snooping is performed and invalidation logic invalidates at least those cache locations corresponding to locations in memory that are affected by the requesting bus master. The X%DIRTY function provides write-back latency control by dynamically switching the cache from write-back to write-through mode if a cache write would cause the number of cache locations containing dirty data to exceed a predetermined maximum percentage of the total number of cache locations.

Patent
09 Feb 1994
TL;DR: In this paper, a cache for improving access to optical media including a primary cache comprising RAM and a secondary cache comprising a portion of hard disk memory is defined, and a discrimination methodology is implemented for determining when data should not be cached.
Abstract: A cache for improving access to optical media includes a primary cache comprising RAM and a secondary cache comprising a portion of hard disk memory. Multiple aspects of the invention are defined: (1) Cache data discrimination: Discrimination methodology is implemented for determining when data should not be cached. Under certain conditions, caching of data is less likely to improve access time. (e.g., when the transfer rate is already exceeding a critical sustained throughput rate; when an estimated time to complete a CD-ROM data request is within a specific percentage of the estimated time to complete a hard drive disk request). (2) Secondary cache fragmentation avoidance: To keep the access time to secondary cache faster than the access time to the optical media, fragmentation of the secondary cache (i.e., hard disk) is minimized. To do so, constraints are imposed: (i) an entire CD-ROM request is stored in contiguous sectors on the hard drive; (ii) sequential CD-ROM requests to adjacent sectors of CD-ROM are concatenated on the hard drive; (iii) data redundancy is permitted). (3) Alternative update methodologies: Cache updates are performed in sequence or in parallel to primary and secondary cache depending upon the embodiment. (4) Data integrity: Integrity of data stored in non-volatile secondary cache is maintained for a substantial portion of secondary cache through power failures, shutdowns and media swaps.