scispace - formally typeset
Search or ask a question

Showing papers on "Cache pollution published in 1995"


18 Jul 1995
TL;DR: This work assesses the potential of proxy servers to cache documents retrieved with the HTTP protocol, and finds that a proxy server really functions as a second level cache, and its hit rate may tend to decline with time after initial loading given a more or less constant set of users.
Abstract: As the number of World-Wide Web users grow, so does the number of connections made to servers. This increases both network load and server load. Caching can reduce both loads by migrating copies of server files closer to the clients that use those files. Caching can either be done at a client or in the network (by a proxy server or gateway). We assess the potential of proxy servers to cache documents retrieved with the HTTP protocol. We monitored traffic corresponding to three types of educational workloads over a one semester period, and used this as input to a cache simulation. Our main findings are (1) that with our workloads a proxy has a 30-50% maximum possible hit rate no matter how it is designed; (2) that when the cache is full and a document is replaced, least recently used (LRU) is a poor policy, but simple variations can dramatically improve hit rate and reduce cache size; (3) that a proxy server really functions as a second level cache, and its hit rate may tend to decline with time after initial loading given a more or less constant set of users; and (4) that certain tuning configuration parameters for a cache may have little benefit.

495 citations


Proceedings ArticleDOI
01 Jun 1995
TL;DR: This paper presents a new algorithm for choosing problem-size dependent tile sizes based on the cache size and cache line size for a direct-mapped cache that eliminates both capacity and self-interference misses and reduces cross-Interference misses.
Abstract: When dense matrix computations are too large to fit in cache, previous research proposes tiling to reduce or eliminate capacity misses. This paper presents a new algorithm for choosing problem-size dependent tile sizes based on the cache size and cache line size for a direct-mapped cache. The algorithm eliminates both capacity and self-interference misses and reduces cross-interference misses. We measured simulated miss rates and execution times for our algorithm and two others on a variety of problem sizes and cache organizations. At higher set associativity, our algorithm does not always achieve the best performance. However on direct-mapped caches, our algorithm improves simulated miss rates and measured execution times when compared with previous work.

434 citations



Proceedings ArticleDOI
01 Dec 1995
TL;DR: The bare minimum amount of local memories that programs require to run without delay is measured by using the Value Reuse Profile, which contains the dynamic value reuse information of a program's execution, and by assuming the existence of efficient memory systems.
Abstract: As processor performance continues to improve, more emphasis must be placed on the performance of the memory system. In this paper, a detailed characterization of data cache behavior for individual load instructions is given. We show that by selectively applying cache line allocation according the characteristics of individual load instructions, overall performance can be improved for both the data cache and the memory system. This approach can improve some aspects of memory performance by as much as 60 percent on existing executables.

221 citations


Patent
23 May 1995
TL;DR: In this paper, an apparatus and method to adjust the caching strategy used for writing its write request data to storage media during execution of various software applications is presented. But, the cache flushing parameter is transferred from the host computer to a controller which has a cache memory, and a quantity of query data is written from the cache memory to a storage medium in accordance with the cache-flushing parameter.
Abstract: An apparatus and method is disclosed which enables a host computer to adjust the caching strategy used for writing its write request data to storage media during execution of various software applications. The method includes the step of generating a caching-flushing parameter in the host computer. The cache flushing parameter is then transferred from the host computer to a controller which has a cache memory. Thereafter, a quantity of write request data is written from the cache memory to a storage medium in accordance with the cache-flushing parameter.

194 citations


Journal ArticleDOI
01 Nov 1995
TL;DR: A method to maintain predictability of execution time within preemptive, cached real-time systems is introduced and the impact on compilation support for such a system is discussed.
Abstract: Cache memories have become an essential part of modern processors to bridge the increasing gap between fast processors and slower main memory. Until recently, cache memories were thought to impose unpredictable execution time behavior for hard real-time systems. But recent results show that the speedup of caches can be exploited without a significant sacrifice of predictability. These results were obtained under the assumption that real-time tasks be scheduled non-preemptively.This paper introduces a method to maintain predictability of execution time within preemptive, cached real-time systems and discusses the impact on compilation support for such a system. Preemptive systems with caches are made predictable via software-based cache partitioning. With this approach, the cache is divided into distinct portions associated with a real-time task, such that a task may only use its portion. The compiler has to support instruction and data partitioning for each task. Instruction partitioning involves non-linear control-flow transformations, while data partitioning involves code transformations of data references. The impact on execution time of these transformations is also discussed.

139 citations


Proceedings ArticleDOI
Jaishankar Moothedath Menon1
02 Aug 1995
TL;DR: This paper compares the performance of the well-known RAID-5 arrays to that of log-structured arrays (LSA), on transaction-processing workloads and looks at sensitivity of LSA performance to amount of free space on the physical disks and to the compression ratio achieved.
Abstract: In this paper, we compare the performance of the well-known RAID-5 arrays to that of log-structured arrays (LSA), on transaction-processing workloads. LSA borrows heavily from the log-structured file system (LFS) approach, but is executed in an outboard disk controller. The LSA technique we examine combines LFS, RAID, compression and non-volatile cache. We look at sensitivity of LSA performance to amount of free space on the physical disks and to the compression ratio achieved. We also evaluate a RAID-5 design that supports compression in cache.

120 citations


Patent
03 Jan 1995
TL;DR: In this article, a data processing system having flexibility coping with parallelism of a program comprises a plurality of processor elements for executing instructions, a main memory shared by the plurality of processors, and parallel operation control facilities for enabling the plurality processors to operate in synchronism.
Abstract: A data processing system having flexibility coping with parallelism of a program comprises a plurality of processor elements for executing instructions, a main memory shared by the plurality of processor elements, and a plurality of parallel operation control facilities for enabling the plurality of processor elements to operate in synchronism. The plurality of parallel operation control facilities are provided in correspondence to the plurality of processor elements, respectively. The data processing system further comprises a multiprocessor operation control facility for enabling the plurality of processor elements to operate independently, and a flag for holding a value indicating which of the parallel operation mode or the multiprocessor mode is to be activated. The shared cache memory is implemented in a blank instruction and controlled by a cache controller so that inconsistency of the data stored in the cache memory is eliminated.

116 citations


Patent
31 Aug 1995
TL;DR: In this article, a data cache configured to perform store accesses in a single clock cycle is provided, where the data cache speculatively stores data within a predicted way of the cache after capturing the data currently being stored in that predicted way.
Abstract: A data cache configured to perform store accesses in a single clock cycle is provided. The data cache speculatively stores data within a predicted way of the cache after capturing the data currently being stored in that predicted way. During a subsequent clock cycle, the cache hit information for the store access validates the way prediction. If the way prediction is correct, then the store is complete. If the way prediction is incorrect, then the captured data is restored to the predicted way. If the store access hits in an unpredicted way, the store data is transferred into the correct storage location within the data cache concurrently with the restoration of data in the predicted storage location. Each store for which the way prediction is correct utilizes a single clock cycle of data cache bandwidth. Additionally, the way prediction structure implemented within the data cache bypasses the tag comparisons of the data cache to select data bytes for the output. Therefore, the access time of the associative data cache may be substantially similar to a direct-mapped cache access time. The present data cache is therefore suitable for high frequency superscalar microprocessors.

114 citations


Patent
13 Nov 1995
TL;DR: In this paper, a cache line is merged with the cache line prior to storage in the cache and other matching entries become active and are allowed to reaccess the cache (71).
Abstract: A data processor (40) keeps track of misses to a cache (71) so that multiple misses within the same cache line can be merged or folded at reload time. A load/store unit (60) includes a completed store queue (61) for presenting store requests to the cache (71) in order. If a store request misses in the cache (71), the completed store queue (61) requests the cache line from a lower-level memory system (90) and thereafter inactivates the store request. When a reload cache line is received, the completed store queue (61) compares the reload address to all entries. If at least one address matches the reload address, one entry's data is merged with the cache line prior to storage in the cache (71). Other matching entries become active and are allowed to reaccess the cache (71). A miss queue (80) coupled between the load/store unit (60) and the lower-level memory system (90) implements reload folding to improve efficiency.

111 citations


Patent
29 Nov 1995
TL;DR: In this article, a second, substantially identical CPU core is placed on a microprocessor die when the die contains a large cache, and the large cache is shared among the two CPU cores.
Abstract: Manufacturing yield is increased and cost lowered when a second, substantially identical CPU core is placed on a microprocessor die when the die contains a large cache. The large cache is shared among the two CPU cores. When one CPU core is defective, the large cache memory may be used by the other CPU core. Thus having two complete CPU cores on the die greatly increases the probability that the large cache can be used, and the manufacturing yield is therefore increased. When both CPU cores are functional, the die may be sold as a dual-processor. However, when no dual-processor chips are to be sold, the die are still manufactured as dual-processor die but packaged only as uni-processor chips. With the higher total yield of the dual-CPU die, the dual-CPU die may be packaged solely as uni-processor chips at lower cost than using uni-processor die. An on-chip ROM for generating test vectors, a floating point unit, and a bus-interface unit are also shared along with the large cache.

Patent
27 Apr 1995
TL;DR: In this paper, a central processing unit (CPU) activity monitor includes a timer and an activity event counter for receiving a plurality of mode signals from the CPU (28), a cache miss signal from a cache memory system (30), and a clock signals from a clock (26).
Abstract: A central processing unit ('CPU') activity monitor and method provides CPU (28) activity information. The CPU activity monitor includes a timer and an activity event counter for receiving a plurality of mode signals from the CPU (28), a cache miss signal from a cache memory system (30), and a clock signal from a clock (26). An activity-to-inactivity value defines when the CPU transitions from an active state to an inactive state. An activity threshold defines when the CPU transitions from an inactive state to an active state.

Patent
13 Oct 1995
TL;DR: In this paper, an adaptive read ahead cache is provided with a real cache and a virtual cache, where the real cache has a data buffer, an address buffer, and a status buffer.
Abstract: An adaptive read ahead cache is provided with a real cache and a virtual cache. The real cache has a data buffer, an address buffer, and a status buffer. The virtual cache contains only an address buffer and a status buffer. Upon receiving an address associated with the consumer's request, the cache stores the address in the virtual cache address buffer if the address is not found in the real cache address buffer and the virtual cache address buffer. Further, the cache fills the real cache data buffer with data responsive to the address from said memory if the address is found only in the virtual cache address buffer. The invention thus loads data into the cache only when sequential accesses are occurring and minimizes the overhead of unnecessarily filling the real cache when the host is accessing data in a random access mode.

20 Nov 1995
TL;DR: The technique of static cache simulation is shown to address the issue of predicting cache behavior, contrary to the belief that cache memories introduce unpredictability to real-time systems that cannot be efficiently analyzed.
Abstract: This work takes a fresh look at the simulation of cache memories. It introduces the technique of static cache simulation that statically predicts a large portion of cache references. To efficiently utilize this technique, a method to perform efficient on-the-fly analysis of programs in general is developed and proved correct. This method is combined with static cache simulation for a number of applications. The application of fast instruction cache analysis provides a new framework to evaluate instruction cache memories that outperforms even the fastest techniques published. Static cache simulation is shown to address the issue of predicting cache behavior, contrary to the belief that cache memories introduce unpredictability to real-time systems that cannot be efficiently analyzed. Static cache simulation for instruction caches provides a large degree of predictability for real-time systems. In addition, an architectural modification through bit-encoding is introduced that provides fully predictable caching behavior. Even for regular instruction caches without architectural modifications, tight bounds for the execution time of real-time programs can be derived from the information provided by the static cache simulator. Finally, the debugging of real-time applications can be enhanced by displaying the timing information of the debugged program at breakpoints. The timing information is determined by simulating the instruction cache behavior during program execution and can be used, for example, to detect missed deadlines and locate time-consuming code portions. Overall, the technique of static cache simulation provides a novel approach to analyze cache memories and has been shown to be very efficient for numerous applications.

Patent
31 Mar 1995
TL;DR: In this paper, a multiprocessor computer system is provided having a multiplicity of sub-systems and a main memory coupled to a system controller, each of which includes a master interface having master classes for sending memory transaction requests to the system controller.
Abstract: A multiprocessor computer system is provided having a multiplicity of sub-systems and a main memory coupled to a system controller. An interconnect module, interconnects the main memory and sub-systems in accordance with interconnect control signals received from the system controller. At least two of the sub-systems are data processors, each having a respective cache memory that stores multiple blocks of data and a respective master cache index. Each master cache index has a set of master cache tags (Etags), including one cache tag for each data block stored by the cache memory. Each data processor includes a master interface having master classes for sending memory transaction requests to the system controller. The system controller includes memory transaction request logic for processing each memory transaction request by a data processor. The system controller maintains a duplicate cache index having a set of duplicate cache tags (Dtags) for each data processor. Each data processor has a writeback buffer for storing the data block previously stored in a victimized cache line until its respective writeback transaction is completed and an Nth+1 Dtag for storing the cache state of a cache line associated with a read transaction which is executed prior to an associated writeback transaction of a read-writeback transaction pair. Accordingly, upon a cache miss, the interconnect may execute the read and writeback transactions in parallel relying on the writeback buffer or Nth+1 Dtag to accommodate any ordering of the transactions.

Proceedings ArticleDOI
01 Dec 1995
TL;DR: This paper presents a latency-hiding compiler technique that is applicable to general-purpose C programs that 'preloads' the data that are likely to cause a cache-miss before they are used, and thereby hiding the cache miss latency.
Abstract: Previous research on hiding memory latencies has tended to focus on regular numerical programs. This paper presents a latency-hiding compiler technique that is applicable to general-purpose C programs. By assuming a lock-up free cache and instruction score-boarding, our technique 'preloads' the data that are likely to cause a cache-miss before they are used, and thereby hiding the cache miss latency. We have developed simple compiler heuristics to identify load instructions that are likely to cause a cache-miss. Experimentation with a set of SPEC92 benchmarks shows that our heuristics are successful in identifying 85% of cache misses. We have also developed an algorithm that flexibly schedules the selected load instruction and instructions that use the loaded data to hide memory latency. Our simulation suggests that our technique is successful in hiding memory latency and improves the overall performance.

Patent
31 Mar 1995
TL;DR: In this article, a multiprocessor computer system has a multiplicity of sub-systems and a main memory coupled to a system controller, and the system controller maintains a set of duplicate cache tags (Dtags) for each data processor.
Abstract: A multiprocessor computer system has a multiplicity of sub-systems and a main memory coupled to a system controller. An interconnect module, interconnects the main memory and sub-systems in accordance with interconnect control signals received from the system controller. All of the sub-systems include a port that transmits and receives data as data packets of a fixed size. At least two of the sub-systems are data processors, each having a respective cache memory and a respective set of master cache tags (Etags), including one cache tag for each data block stored by the cache memory. The system controller maintains a set of duplicate cache tags (Dtags) for each of the data processors. The data processors each include master cache logic for updating the master cache tags, while the system controller includes logic for updating the duplicate cache tags. Memory transaction request logic simultaneously looks up the second cache tag in each of the sets of duplicate cache tags corresponding to the memory transaction request. It then determines which one of the cache memories and main memory to couple to the requesting data processor based on the second cache states and the address tags stored in the corresponding second cache tags. Duplicate cache update logic simultaneously updates all of the corresponding second cache tags in accordance with predefined cache tag update criteria.

Patent
14 Jul 1995
TL;DR: In this article, a power conservation mechanism for the cache SRAM memory blocks is proposed. But the power conservation scheme is limited to the case where the CPU is not accessing the cache memory.
Abstract: A synchronous cache memory power conservation apparatus for conserving power of the cache SRAM memory blocks in cached computer systems. The power conservation apparatus is included as a portion of the logic of the cache controller of the computer system. The power conservation apparatus monitors the CPU bus cycles in order to shut off the clocking signals supplied to the cache SRAM memory blocks when the CPU is not accessing the cache memory, thereby reducing the power consumption of the high-power SRAM devices. The power conservation apparatus resumes standard synchronized clocking to the cache SRAM blocks when the CPU is performing a cache-hit memory access cycle for maximum cache access performance.

Patent
03 Nov 1995
TL;DR: In this paper, a doubly linked list is used to track the most recently used channels and the corresponding entry is moved to the top of the list as cached channel information is accessed, and the bottom pointer points to the channel data to be removed from the cache.
Abstract: An on-chip cache memory is used to provide a high speed access mechanism to frequently used channel state information for operation of a DMA device that supports multiple virtual channels in a high speed network interface. When an access to a particular channel state is performed, e.g., by a host processor or the DMA device, the cache is first accessed and if the state information is not located currently in the cache, external memory is read and the state information is written to the cache. As the cache does not store all the states stored in external memory, replacement algorithms are utilize to determine which channel state information to remove from the cache in order to provide room to store a recently accessed channel. A doubly linked list is used to track the most recently used channel. As cached channel information is accessed, the corresponding entry is moved to the top of the list. The doubly linked list provides a rapid apparatus and method for updating pointers to the cache. Top and bottom pointers are maintained, pointing to the most recently used and least recently used channels. When a channel is used, it moved to the top of the list. When channel data is moved from external memory to the cache, the bottom pointer points to the channel data to be removed from the cache.

Patent
18 Dec 1995
TL;DR: In this article, an x86 microprocessor system with a process identification system which stores a number assigned to each process run by the microprocessor systems and associates this number with instructions, data, and information fetched and stored in a cache or translation lookaside buffer (TLB) during the execution of the process is described.
Abstract: An x86 microprocessor system with a process identification system which stores a number assigned to each process run by the microprocessor system and associates this number with instructions, data, and information fetched and stored in a cache or translation lookaside buffer (TLB) during the execution of the process. Upon a process or context switch, the instructions, data, and information are not automatically flushed from the cache and TLB. The instructions, data, and information are replaced only when instructions, data, and information for a new process require the same cache memory locations or the same TLB memory location. The cache and TLB may include a valid bit block and a tag block that includes memory locations for storing the pertinent process identification number for each entry. The cache, which may be a set associative cache, and TLB include logic for comparing a process identification number stored in a process identification register with the process identification number stored in the tag block.

Proceedings ArticleDOI
04 Jan 1995
TL;DR: Experimental results suggest that both the block buffering and Gray code addressing techniques are ideal for instruction cache designs which tend to be accessed in a consecutive sequence and can achieve an order of magnitude energy reduction on caches.
Abstract: Caches usually consume a significant amount of energy in modern microprocessors (eg superpipelined or superscalar processors) In this paper, we examine contemporary cache design techniques and provide an analytical model for estimating cache energy consumption We also present several novel techniques for designing energy-efficient caches, which include block buffering, cache sub-banking, and Gray code addressing Experimental results suggest that both the block buffering and Gray code addressing techniques are ideal for instruction cache designs which tend to be accessed in a consecutive sequence Cache sub-banking is ideal for both instruction and data caches Overall, these techniques can achieve an order of magnitude energy reduction on caches >

Patent
07 Jul 1995
TL;DR: In this paper, the PCI-bus controller receives a request from a PCIbus master to transfer data with an address in secondary memory, and the controller performs an initial inquire cycle and withholds TRDY# to the master until any write-back cycle completes.
Abstract: When a PCI-bus controller receives a request from a PCI-bus master to transfer data with an address in secondary memory, the controller performs an initial inquire cycle and withholds TRDY# to the PCI-bus master until any write-back cycle completes. The controller then allows the burst access to take place between secondary memory and the PCI-bus master, and simultaneously and predictively, performs an inquire cycle of the L1 cache for the next cache line. In this manner, if the PCI burst continues past the cache line boundary, the new inquire cycle will already have taken place, or will already be in progress, thereby allowing the burst to proceed with, at most, a short delay. Predictive snoop cycles are not performed if the first transfer of a PCI-bus master access would be the last transfer before a cache line boundary is reached.

Patent
Akio Shigeeda1
15 Mar 1995
TL;DR: In this paper, an electronic device for use in a computer system, and having a small second-level write-back cache, is disclosed, where the device may be implemented into a single integrated circuit, as a microprocessor unit, to include a micro processor core, a memory controller circuit, and first and second level caches.
Abstract: An electronic device for use in a computer system, and having a small second-level write-back cache, is disclosed. The device may be implemented into a single integrated circuit, as a microprocessor unit, to include a microprocessor core, a memory controller circuit, and first and second level caches. In a system implementation, the device is connected to external dynamic random access memory (DRAM). The first level cache is a write-through cache, while the second level cache is a write-back cache that is much smaller than the first level cache. In operation, a write access that is a cache hit in the second level cache writes to the second level cache, rather than to DRAM, thus saving a wait state. A dirty bit is set for each modified entry in the second level cache. Upon the second level cache being full of modified data, a cache flush to DRAM is automatically performed. In addition, each entry of the second level cache is flushed to DRAM upon each of its byte locations being modified. The computer system may also include one or more additional integrated circuit devices, such as a direct memory access (DMA) circuit and a bus bridge interface circuit for bidirectional communication with the microprocessor unit. The microprocessor unit may also include handshaking control to prohibit configuration register updating when a memory access is in progress or is imminent. The disclosed microprocessor unit also includes circuitry for determining memory bank size and memory address type.

Patent
10 Jul 1995
TL;DR: In this paper, a frame buffer memory device controller that schedules and dispatches cache control operations to reduce timing overheads caused by cache prefetch operations, and operations to write back dirty cache lines and clear cache lines in the Frame buffer memory devices.
Abstract: A frame buffer memory device controller that schedules and dispatches operations to frame buffer memory devices is disclosed. The frame buffer memory device controller schedules and dispatches cache control operations to reduce timing overheads caused by cache prefetch operations, and operations to write back dirty cache lines and clear cache lines in the frame buffer memory devices. The frame buffer memory device controller also schedules and dispatches control operations to reduce timing overheads caused by video refresh operations from the frame buffer memory devices video output ports.

Patent
31 Aug 1995
TL;DR: In this article, a superscalar microprocessor employing a way prediction structure is provided, which predicts a way of an associative cache in which an access will hit, and causes the data bytes from the predicted way to be conveyed as the output of the cache.
Abstract: A superscalar microprocessor employing a way prediction structure is provided. The way prediction structure predicts a way of an associative cache in which an access will hit, and causes the data bytes from the predicted way to be conveyed as the output of the cache. The typical tag comparisons to the request address are bypassed for data byte selection, causing the access time of the associative cache to be substantially the access time of the direct-mapped way prediction array within the way prediction structure. Also included in the way prediction structure is a way prediction control unit configured to update the way prediction array when an incorrect way prediction is detected. The clock cycle of the superscalar microprocessor including the way prediction structure with its caches may be increased if the cache access time is limiting the clock cycle. Additionally, the associative cache may be retained in the high frequency superscalar microprocessor (which might otherwise employ a direct-mapped cache for access time reasons). Single clock cycle cache access to an associative data cache is maintained for high frequency operation.

Proceedings ArticleDOI
01 May 1995
TL;DR: This paper presents the design and evaluation of a fast address generation mechanism capable of eliminating the delays caused by effective address calculation for many loads and stores, and responds well to software support, in many cases providing better program speedups and reducing cache bandwidth requirements.
Abstract: For many programs, especially integer codes, untolerated load instruction latencies account for a significant portion of total execution time. In this paper, we present the design and evaluation of a fast address generation mechanism capable of eliminating the delays caused by effective address calculation for many loads and stores.Our approach works by predicting early in the pipeline (part of) the effective address of a memory access and using this predicted address to speculatively access the data cache. If the prediction is correct, the cache access is overlapped with non-speculative effective address calculation. Otherwise, the cache is accessed again in the following cycle, this time using the correct effective address. The impact on the cache access critical path is minimal; the prediction circuitry adds only a single OR operation before cache access can commence. In addition, verification of the predicted effective address is completely decoupled from the cache access critical path.Analyses of program reference behavior and subsequent performance analysis of this approach shows that this design is a good one, servicing enough accesses early enough to result in speedups for all the programs we tested. Our approach also responds well to software support, which can significantly reduce the number of mispredicted effective addresses, in many cases providing better program speedups and reducing cache bandwidth requirements.

Patent
Michael Kagan1, David Perlmutter1
10 May 1995
TL;DR: In this article, a multiprocessor computer system which maintains cache coherency includes first and second microprocessors each having an associated cache memory storing lines of data, each line of data has associated protocol bits that indicate a protocol state consistent with write-through, write-back, or write-once cache cohemrency policies that are selected via a protocol selection terminal for different system configurations.
Abstract: A multiprocessor computer system which maintains cache coherency includes first and second microprocessors each having an associated cache memory storing lines of data. Each line of data has associated protocol bits that indicate a protocol state consistent with write-through, write-back, or write-once cache coherency policies that are selected via a protocol selection terminal for different system configurations. In one configuration, the output and external address terminals of the first microprocessor are coupled to the external and output address terminals, respectively, of the second microprocessor. This configuration enables each microprocessor to snoop memory cycles to main memory initiated by the other microprocessor so that it can be readily determined if a particular cache has the latest version of data.

Proceedings ArticleDOI
25 Apr 1995
TL;DR: The utility of OS-based page placement as a mechanism to increase the frequency with which cache fills access local memory in distributed shared memory multiprocessors and no performance advantage in more sophisticated policies are found, including page migration and page replication.
Abstract: The cost of a cache miss depends heavily on the location of the main memory that backs the missing line. For certain applications, this cost is a major factor in overall performance. We report on the utility of OS-based page placement as a mechanism to increase the frequency with which cache fills access local memory in distributed shared memory multiprocessors. Even with the very simple policy of first-use placement, we find significant improvements over round-robin placement for many applications on both hardware- and software-coherent systems. For most of our applications, first-use placement allows 35 to 75 percent of cache fills to be performed locally, resulting in performance improvements of up to 40 percent with respect to round-robin placement. We were surprised to find no performance advantage in more sophisticated policies, including page migration and page replication. In fact, in many cases the performance of our applications suffered under these policies. >

Proceedings ArticleDOI
22 Jan 1995
TL;DR: This paper characterizes in detail the locality patterns of the operating system code and shows that there is substantial locality, and proposes an algorithm to expose these localities and reduce interference.
Abstract: High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to use an optimizing compiler to minimize cache interference via an improved layout of the code. This technique, however, has been applied to application code only, even though there is evidence that the operating system often uses the cache heavily and with less uniform patterns than applications. Therefore, it is unknown how well existing optimizations perform for systems code and whether better optimizations can be found. We address this problem in this paper. This paper characterizes in detail the locality patterns of the operating system code and shows that there is substantial locality. Unfortunately, caches are not able to extract much of it: rarely-executed special-case code disrupts spatial locality, loops with few iterations that call routines make loop locality hard to exploit, and plenty of loop-less code hampers temporal locality. As a result, interference within popular execution paths dominates instruction cache misses. Based on our observations, we propose an algorithm to expose these localities and reduce interference. For a range of cache sizes, associativities, lines sizes, and other organizations we show that we reduce total instruction miss rates by 31-86% (up to 2.9 absolute points). Using a simple model this corresponds to execution time reductions in the order of 12-26%. In addition, our optimized operating system combines well with optimized or unoptimized applications. >

Patent
31 Mar 1995
TL;DR: In this article, a multiprocessor computer system is provided having a multiplicity of sub-systems and a main memory coupled to a system controller, where each data processor includes a master interface for sending memory transaction requests to the system controller and for receiving cache access requests from the system controllers corresponding to memory transaction request by other ones of the data processors.
Abstract: A multiprocessor computer system is provided having a multiplicity of sub-systems and a main memory coupled to a system controller. An interconnect module, interconnects the main memory and sub-systems in accordance with interconnect control signals received from the system controller. At least two of the sub-systems are data processors, each having a respective cache memory that stores multiple blocks of data and a respective master cache index. Each master cache index has a set of master cache tags (Etags), including one cache tag for each data block stored by the cache memory. Each data processor includes a master interface for sending memory transaction requests to the system controller and for receiving cache access requests from the system controller corresponding to memory transaction requests by other ones of the data processors. In the preferred embodiment, each memory transaction request is classified into one of two distinct master classes: a first transaction class including read memory access requests and a second transaction class including writeback memory access requests. The master interface and system controller have corresponding parallel request queues, one for each master class, for transmitting and receiving memory access requests. The system controller further includes memory transaction request logic for processing each memory transaction request and a duplicate cache index having a set of duplicate cache tags (Dtags), including one cache tag corresponding to each master cache tag in an associated data processor.