scispace - formally typeset
Search or ask a question

Showing papers on "Cache pollution published in 1999"


Proceedings ArticleDOI
16 Nov 1999
TL;DR: In this paper, a tradeoff between performance and energy is made between a small performance degradation for energy savings, and the tradeoff can produce a significant reduction in cache energy dissipation.
Abstract: Increasing levels of microprocessor power dissipation call for new approaches at the architectural level that save energy by better matching of on-chip resources to application requirements. Selective cache ways provides the ability to disable a subset of the ways in a set associative cache during periods of modest cache activity, while the full cache may remain operational for more cache-intensive periods. Because this approach leverages the subarray partitioning that is already present for performance reasons, only minor changes to a conventional cache are required, and therefore, full-speed cache operation can be maintained. Furthermore, the tradeoff between performance and energy is flexible, and can be dynamically tailored to meet changing application and machine environmental conditions. We show that trading off a small performance degradation for energy savings can produce a significant reduction in cache energy dissipation using this approach.

733 citations


Proceedings ArticleDOI
01 May 1999
TL;DR: It is demonstrated that careful data organization and layout provides an essential mechanism to improve the cache locality of pointer-manipulating programs and consequently, their performance.
Abstract: Hardware trends have produced an increasing disparity between processor speeds and memory access times. While a variety of techniques for tolerating or reducing memory latency have been proposed, these are rarely successful for pointer-manipulating programs.This paper explores a complementary approach that attacks the source (poor reference locality) of the problem rather than its manifestation (memory latency). It demonstrates that careful data organization and layout provides an essential mechanism to improve the cache locality of pointer-manipulating programs and consequently, their performance. It explores two placement techniques---clustering and coloring---that improve cache performance by increasing a pointer structure's spatial and temporal locality, and by reducing cache-conflicts.To reduce the cost of applying these techniques, this paper discusses two strategies---cache-conscious reorganization and cache-conscious allocation---and describes two semi-automatic tools---ccmorph and ccmalloc---that use these strategies to produce cache-conscious pointer structure layouts. ccmorph is a transparent tree reorganizer that utilizes topology information to cluster and color the structure. ccmalloc is a cache-conscious heap allocator that attempts to co-locate contemporaneously accessed data elements in the same physical cache block. Our evaluations, with microbenchmarks, several small benchmarks, and a couple of large real-world applications, demonstrate that the cache-conscious structure layouts produced by ccmorph and ccmalloc offer large performance benefits---in most cases, significantly outperforming state-of-the-art prefetching.

382 citations


Journal ArticleDOI
TL;DR: This article describes methods for generating and solving Cache Miss Equations (CMEs) that give a detailed representation of cache behavior, including conflict misses, in loop-oriented scientific code within the SUIF compiler framework.
Abstract: With the ever-widening performance gap between processors and main memory, cache memory, which is used to bridge this gap, is becoming more and more significant. Caches work well for programs that exhibit sufficient locality. Other programs, however, have reference patterns that fail to exploit the cache, thereby suffering heavily from high memory latency. In order to get high cache efficiency and achieve good program performance, efficient memory accessing behavior is necessary. In fact, for many programs, program transformations or source-code changes can radically alter memory access patterns, significantly improving cache performance. Both hand-tuning and compiler optimization techniques are often used to transform codes to improve cache utilization. Unfortunately, cache conflicts are difficult to predict and estimate, precluding effective transformations. Hence, effective transformations require detailed knowledge about the frequency and causes of cache misses in the code. This article describes methods for generating and solving Cache Miss Equations (CMEs) that give a detailed representation of cache behavior, including conflict misses, in loop-oriented scientific code. Implemented within the SUIF compiler framework, our approach extends traditional compiler reuse analysis to generate linear Diophantine equations that summarize each loop's memory behavior. While solving these equations is in general difficult, we show that is also unnecessary, as mathematical techniques for manipulating Diophantine equations allow us to relatively easily compute and/or reduce the number of possible solutions, where each solution corresponds to a potential cache miss. The mathematical precision of CMEs allows us to find true optimal solutions for transformations such as blocking or padding. The generality of CMEs also allows us to reason about interactions between transformations applied in concert. The article also gives examples of their use to determine array padding and offset amounts that minimize cache misses, and to determine optimal blocking factors for tiled code. Overall, these equations represent an analysis framework that offers the generality and precision needed for detailed compiler optimizations.

300 citations


Proceedings ArticleDOI
17 Aug 1999
TL;DR: In this paper, a new approach using way prediction for achieving high performance and low energy consumption of set-associative caches is proposed, where only a single cache way is accessed, instead of accessing all the ways in a set.
Abstract: This paper proposes a new approach using way prediction for achieving high performance and low energy consumption of set-associative caches. By accessing only a single cache way predicted, instead of accessing all the ways in a set, the energy consumption can be reduced. This paper shows that the way-predicting set-associative cache improves the ED (energy-delay) product by 60-70% compared to a conventional set-associative cache,.

295 citations


Journal ArticleDOI
TL;DR: This paper proposes the Active Cache scheme, a feasible scheme that can result in significant network bandwidth savings at the expense of moderate CPU costs, and describes the protocol, interface and security mechanisms of the scheme.
Abstract: Dynamic documents constitute an increasing percentage of contents on the Web, and caching dynamic documents becomes an increasingly important issue that affects the scalability of the Web. In this paper, we propose the Active Cache scheme to support caching of dynamic contents at Web proxies. The scheme allows servers to supply cache applets to be attached with documents, and requires proxies to invoke cache applets upon cache hits to furnish the necessary processing without contacting the server. We describe the protocol, interface and security mechanisms of the Active Cache scheme, and illustrate its use via several examples. Through prototype implementation and performance measurements, we show that Active Cache is a feasible scheme that can result in significant network bandwidth savings at the expense of moderate CPU costs.

283 citations


Proceedings ArticleDOI
01 May 1999
TL;DR: In this article, the authors describe two techniques, structure splitting and field reordering, that improve the cache behavior of data structures larger than a cache block by increasing the number of hot fields that can be placed in the cache block.
Abstract: A program's cache performance can be improved by changing the organization and layout of its data---even complex, pointer-based data structures. Previous techniques improved the cache performance of these structures by arranging distinct instances to increase reference locality. These techniques produced significant performance improvements, but worked best for small structures that could be packed into a cache block.This paper extends that work by concentrating on the internal organization of fields in a data structure. It describes two techniques---structure splitting and field reordering---that improve the cache behavior of structures larger than a cache block. For structures comparable in size to a cache block, structure splitting can increase the number of hot fields that can be placed in a cache block. In five Java programs, structure splitting reduced cache miss rates 10--27% and improved performance 6--18% beyond the benefits of previously described cache-conscious reorganization techniques.For large structures, which span many cache blocks, reordering fields, to place those with high temporal affinity in the same cache block can also improve cache utilization. This paper describes bbcache, a tool that recommends C structure field reorderings. Preliminary measurements indicate that reordering fields in 5 active structures improves the performance of Microsoft SQL Server 7.0 2--3%.

278 citations


Proceedings ArticleDOI
Chen Ding1, Ken Kennedy1
01 May 1999
TL;DR: It is demonstrated that run-time program transformations can substantially improve computation and data locality and, despite the complexity and cost involved, a compiler can automate such transformations, eliminating much of the associated run- time overhead.
Abstract: With the rapid improvement of processor speed, performance of the memory hierarchy has become the principal bottleneck for most applications. A number of compiler transformations have been developed to improve data reuse in cache and registers, thus reducing the total number of direct memory accesses in a program. Until now, however, most data reuse transformations have been static---applied only at compile time. As a result, these transformations cannot be used to optimize irregular and dynamic applications, in which the data layout and data access patterns remain unknown until run time and may even change during the computation.In this paper, we explore ways to achieve better data reuse in irregular and dynamic applications by building on the inspector-executor method used by Saltz for run-time parallelization. In particular, we present and evaluate a dynamic approach for improving both computation and data locality in irregular programs. Our results demonstrate that run-time program transformations can substantially improve computation and data locality and, despite the complexity and cost involved, a compiler can automate such transformations, eliminating much of the associated run-time overhead.

232 citations


Journal ArticleDOI
TL;DR: A unified cache maintenance algorithm, LNC-R-WS-U, is described, which integrates both cache replacement and consistency algorithms and considers in the eviction consideration the validation rate of each document, as provided by the cache consistency component of LNC.R-W3-U.
Abstract: Caching at proxy servers is one of the ways to reduce the response time perceived by World Wide Web users. Cache replacement algorithms play a central role in the response time reduction by selecting a subset of documents for caching, so that a given performance metric is maximized. At the same time, the cache must take extra steps to guarantee some form of consistency of the cached documents. Cache consistency algorithms enforce appropriate guarantees about the staleness of the cached documents. We describe a unified cache maintenance algorithm, LNC-R-WS-U, which integrates both cache replacement and consistency algorithms. The LNC-R-WS-U algorithm evicts documents from the cache based on the delay to fetch each document into the cache. Consequently, the documents that took a long time to fetch are preferentially kept in the cache. The LNC-R-W3-U algorithm also considers in the eviction consideration the validation rate of each document, as provided by the cache consistency component of LNC-R-WS-U. Consequently, documents that are infrequently updated and thus seldom require validations are preferentially retained in the cache. We describe the implementation of LNC-R-W3-U and its integration with the Apache 1.2.6 code base. Finally, we present a trace-driven experimental study of LNC-R-W3-U performance and its comparison with other previously published algorithms for cache maintenance.

211 citations


Patent
22 Mar 1999
TL;DR: In this article, a cache system is described that includes a storage that is partitioned into a plurality of storage areas, each for storing one kind of objects received from remote sites and to be directed to target devices.
Abstract: A cache system is described that includes a storage that is partitioned into a plurality of storage areas, each for storing one kind of objects received from remote sites and to be directed to target devices. The cache system further includes a cache manager coupled to the storage to cause objects to be stored in the corresponding storage areas of the storage. The cache manager causes cached objects in each of the storage areas to be replaced in accordance with one of a plurality of replacement policies, each being optimized for one kind of objects.

197 citations


Proceedings ArticleDOI
01 Jun 1999
TL;DR: This work presents a memory exploration strategy based on three performance metrics, namely, cache size, the number of processor cycles and the energy consumption, and shows how the performance is affected by cache parameters such as caches size, line size, set associativity and tiling, and the off-chip data organization.
Abstract: In embedded system design, the designer has to choose an on-chip memory configuration that is suitable for a specific application. To aid in this design choice, we present a memory exploration strategy based on three performance metrics, namely, cache size, the number of processor cycles and the energy consumption. We show how the performance is affected by cache parameters such as cache size, line size, set associativity and tiling, and the off-chip data organization. We show the importance of including energy in the performance metrics, since an increase in the cache line size, cache size, tiling and set associativity reduces the number of cycles but does not necessarily reduce the energy consumption. These performance metrics help us find the minimum energy cache configuration if time is the hard constraint, or the minimum time cache configuration if energy is the hard constraint.

193 citations


Proceedings ArticleDOI
17 Aug 1999
TL;DR: This paper proposes using a small instruction buffer, also called a loop cache, to save power in caches, which has no address tag store and knows precisely whether the next instruction request will hit in the loop cache well ahead of time.
Abstract: A fair amount of work has been done in recent years on reducing power consumption in caches by using a small instruction buffer placed between the execution pipe and a larger main cache. These techniques, however, often degrade the overall system performance. In this paper, we propose using a small instruction buffer, also called a loop cache, to save power. A loop cache has no address tag store. It consists of a direct-mapped data array and a loop cache controller. The loop cache controller knows precisely whether the next instruction request will hit in the loop cache, well ahead of time. As a result, there is no performance degradation.

Patent
03 Mar 1999
TL;DR: In this article, the authors propose a technique for automatic, transparent, distributed, scalable and robust replication of document copies in a computer network where request messages for a particular document follow paths from the clients to a home server that form a routing graph.
Abstract: A technique for automatic, transparent, distributed, scalable and robust replication of document copies in a computer network wherein request messages for a particular document follow paths from the clients to a home server that form a routing graph. Client request messages are routed up the graph towards the home server as would normally occur in the absence of caching. However, cache servers are located along the route, and may intercept requests if they can be serviced. In order to be able to service requests in this manner without departing from standard network protocols, the cache server needs to be able to insert a packet filter into the router associated with it, and needs also to proxy for the home server from the perspective of the client. Cache servers cooperate to update cache content by communicating with neighboring caches whenever information is received about invalid cache copies.

Journal ArticleDOI
TL;DR: This paper describes an algorithm for procedure placement, one type of code placement, that signicantly differs from previous approaches in the type of information used to drive the placement algorithm, and gathers temporal-ordering information that summarizes the interleaving of procedures in a program trace.
Abstract: Instruction cache performance is important to instruction fetch efficiency and overall processor performance. The layout of an executable has a substantial effect on the cache miss rate and the instruction working set size during execution. This means that the performance of an executable can be improved by applying a code-placement algorithm that minimizes instruction cache conflicts and improves spatial locality. We describe an algorithm for procedure placement, one type of code placement, that signicantly differs from previous approaches in the type of information used to drive the placement algorithm. In particular, we gather temporal-ordering information that summarizes the interleaving of procedures in a program trace. Our algorithm uses this information along with cache configuration and procedure size information to better estimate the conflict cost of a potential procedure ordering. It optimizes the procedure placement for single level and multilevel caches. In addition to reducing instruction cache conflicts, the algorithm simultaneously minimizes the instruction working set size of the program. We compare the performance of our algorithm with a particularly successful procedure-placement algorithm and show noticeable improvements in the instruction cache behavior, while maintaining the same instruction working set size.

Patent
David C. Stewart1
13 Oct 1999
TL;DR: In this paper, a filter driver is provided to monitor writes to the disk and determine if a cache line is invalidated, such that a write to a sector held in the cache results in that sector being invalidated until such time as the cache is updated.
Abstract: A computer system includes a nonvolatile memory positioned between a disk controller and a disk drive storing a boot program, in a computer system. Upon an initial boot sequence, the boot program is loaded into a cache in the nonvolatile memory. Subsequent boot sequences retrieve the boot program from the cache. Cache validity is maintained by monitoring cache misses, and/or by monitoring writes to the disk such that a write to a sector held in the cache results in the cache line for that sector being invalidated until such time as the cache is updated. A filter driver is provided to monitor writes to the disk and determine if a cache line is invalidated.

Journal ArticleDOI
TL;DR: This paper examines the theoretical upper bounds on the cache hit ratio that cache bypassing can provide for integer applications, including several Windows applications with OS activity, and proposes a microarchitecture scheme where the hardware determines data placement within the cache hierarchy based on dynamic referencing behavior.
Abstract: The growing disparity between processor and memory performance has made cache misses increasingly expensive. Additionally, data and instruction caches are not always used efficiently, resulting in large numbers of cache misses. Therefore, the importance of cache performance improvements at each level of the memory hierarchy will continue to grow. In numeric programs, there are several known compiler techniques for optimizing data cache performance. However, integer (nonnumeric) programs often have irregular access patterns that are more difficult for the compiler to optimize. In the past, cache management techniques such as cache bypassing were implemented manually at the machine-language-programming level. As the available chip area grows, it makes sense to spend more resources to allow intelligent control over the cache management. In this paper, we present an approach to improving cache effectiveness, taking advantage of the growing chip area, utilizing run-time adaptive cache management techniques, optimizing both performance and cost of implementation. Specifically, we are aiming to increase data cache effectiveness for integer programs. We propose a microarchitecture scheme where the hardware determines data placement within the cache hierarchy based on dynamic referencing behavior. This scheme is fully compatible with existing instruction set architectures. This paper examines the theoretical upper bounds on the cache hit ratio that cache bypassing can provide for integer applications, including several Windows applications with OS activity. Then, detailed trace-driven simulations of the integer applications are used to show that the implementation described in this paper can achieve performance close to that of the upper bound.

Proceedings ArticleDOI
01 May 1999
TL;DR: This paper examines the performance of compiler and hardware approaches for reordering pages in physically addressed caches to eliminate cache misses and shows that software page placement provided a 28% speedup and hardware page placementprovided a 21% speed up on average for a superscalar processor.
Abstract: As the gap between memory and processor speeds continues to widen, cache efficiency is an increasingly important component of processor performance. Compiler techniques have been used to improve instruction and data cache performance for virtually indexed caches by mapping code and data with temporal locality to different cache blocks. In this paper we examine the performance of compiler and hardware approaches for reordering pages in physically addressed caches to eliminate cache misses. The software approach provides a color mapping at compile-time for code and data pages, which can then be used by the operating system to guide its allocation of physical pages. The hardware approach works by adding a page remap field to the TLB, which is used to allow a page to be remapped to a different color in the physically indexed cache while keeping the same physical page in memory. The results show that software page placement provided a 28% speedup and hardware page placement provided a 21% speedup on average for a superscalar processor. For a 4 processor single-chip multiprocessor, the miss rate was reduced from 8.7% down to 7.2% on average.

Proceedings ArticleDOI
01 May 1999
TL;DR: This work focuses on transient fault tolerance in primary cache memories and develops new architectural solutions, to maximize fault coverage when the budgeted silicon area is not sufficient for the conventional configuration of an error checking code.
Abstract: Information integrity in cache memories is a fundamental requirement for dependable computing. Conventional architectures for enhancing cache reliability using check codes make it difficult to trade between the level of data integrity and the chip area requirement. We focus on transient fault tolerance in primary cache memories and develop new architectural solutions, to maximize fault coverage when the budgeted silicon area is not sufficient for the conventional configuration of an error checking code. The underlying idea is to exploit the corollary of reference locality in the organization and management of the code. A higher protection priority is dynamically assigned to the portions of the cache that are more error-prone and have a higher probability of access. The error-prone likelihood prediction is based on the access frequency. We evaluate the effectiveness of the proposed schemes using a trace-driven simulation combined with software error injection using four different fault manifestation models. From the simulation results, we show that for most benchmarks the proposed architectures are effective and area efficient for increasing the cache integrity under all four models.

Patent
Robert Drew Major1
22 Jun 1999
TL;DR: In this article, a cache object store is organized to provide fast and efficient storage of data as cache objects organized into cache object groups, and a multi-level hierarchical storage architecture comprising a primary memory-level cache store and, optionally, a secondary disk level cache store, each of which is configured to optimize access to the cache objects groups.
Abstract: A cache object store is organized to provide fast and efficient storage of data as cache objects organized into cache object groups. The cache object store preferably embodies a multi-level hierarchical storage architecture comprising a primary memory-level cache store and, optionally, a secondary disk-level cache store, each of which is configured to optimize access to the cache object groups. These levels of the cache object store further exploit persistent and non-persistent storage characteristics of the inventive architecture.

Patent
19 Nov 1999
TL;DR: Curious caching as mentioned in this paper improves upon cache snooping by allowing a cache to insert data from snooped bus operations that is not currently in the cache and independent of any prior accesses to the associated memory location.
Abstract: Curious caching improves upon cache snooping by allowing a snooping cache to insert data from snooped bus operations that is not currently in the cache and independent of any prior accesses to the associated memory location. In addition, curious caching allows software to specify which data producing bus operations, e.g., reads and writes, result in data being inserted into the cache. This is implemented by specifying “memory regions of curiosity” and insertion and replacement policy actions for those regions. In column caching, the replacement of data can be restricted to particular regions of the cache. By also making the replacement address-dependent, column caching allows different regions of memory to be mapped to different regions of the cache. In a set-associative cache, a replacement policy specifies the particular column(s) of the set-associative cache in which a page of data can be stored. The column specification is made in page table entries in a TLB that translates between virtual and physical addresses. The TLB includes a bit vector, one bit per column, which indicates the columns of the cache that are available for replacement.

Proceedings ArticleDOI
01 Oct 1999
TL;DR: This research explores any potential for an on-chip cache compression which can reduce not only cache miss ratio but also miss penalty, if main memory is also managed in compressed form, and suggests several techniques to reduce the decompression overhead and to manage the compressed blocks efficiently.
Abstract: This research explores any potential for an on-chip cache compression which can reduce not only cache miss ratio but also miss penalty, if main memory is also managed in compressed form. However, the decompression time causes a critical effect on the memory access time and variable-sized compressed blocks tend to increase the design complexity of the compressed cache architecture. This paper suggests several techniques to reduce the decompression overhead and to manage the compressed blocks efficiently which include selective compression, fixed space allocation for the compressed blocks, parallel decompression, the use of a decompression buffer, and so on. Moreover a simple compressed cache architecture based on the above techniques and its management method are proposed. The results from trace-driven simulation show that this approach can provide around 35% decrease in the on-chip cache miss ratio as well as a 53% decrease in the data traffic over the conventional memory systems. Also, a large amount of the decompression overhead can be reduced, and thus the average memory access time can also be reduced by maximum 20% against the conventional memory systems.

Patent
31 Mar 1999
TL;DR: In this paper, the authors describe a system for handling requests received from a client for information stored on a server, where cache functions are bypassed or executed based on whether an execution of cache functions in an attempt to access the information from cache is likely to slow processing of a request for the information without at least some compensating reduction in processing time for a request of the information received at a later time.
Abstract: Methods and systems for handling requests received from a client for information stored on a server. In general, when a request for information is received, cache functions are bypassed or executed based on whether an execution of cache functions in an attempt to access the information from cache is likely to slow processing of a request for the information without at least some compensating reduction in processing time for a request for the information received at a later time. Also described is receiving information that identifies the location of a resource within a domain and selecting a cache based on the information that identifies the location of the resource within the domain.

Patent
Hubertus Franke1, Douglas J. Joseph1
29 Mar 1999
TL;DR: In this paper, the authors propose fault contained memory partitioning in a cache coherent, symmetric shared memory multiprocessor system while enabling fault contained cache coherence domains as well as cache coherent inter partition memory regions.
Abstract: The present invention provides fault contained memory partitioning in a cache coherent, symmetric shared memory multiprocessor system while enabling fault contained cache coherence domains as well as cache coherent inter partition memory regions. The entire system may be executed as a single coherence domain regardless of partitioning, and the general memory access and cache coherency traffic are distinguished. All memory access is intercepted and processed by the memory controller. Before data is read from or written to memory, the address is verified and the executed operation is aborted if the address is outside the memory regions assigned to the processor in use. Inter cache requests are allowed to pass, though concurrently the accessed memory address is verified in the same manner as the memory requests. During the corresponding inter cache response, a failed validity check for the request results in the stopping of the requesting processor and the repair of the potentially corrupted memory hierarchy of the responding processor.

Proceedings ArticleDOI
09 Jan 1999
TL;DR: It is shown that for the first two optimizations, instruction-based prediction, using few predictor entries per node, outpaces address based schemes, and for the producer consumer optimization which uses speculative execution, low mis speculation rates show promise for performance improvements.
Abstract: We propose Instruction-based Prediction as a means to optimize directory based cache coherent NUMA shared memory. Instruction-based prediction is based on observing the behavior of load and store instructions in relation to coherent events and predicting their future behavior. Although this technique is well established in the uniprocessor world, it has not been widely applied for optimizing transparent shared memory. Typically, in this environment, prediction is based on data block access history (address based prediction) in the form of adaptive cache coherence protocols. The advantage of instruction-based prediction is that it requires few hardware resources in the form of small prediction structures per node to match (or exceed) the performance of address based prediction. To show the potential of instruction-based prediction we propose and evaluate three different optimizations: i) a migratory sharing optimization, ii) a wide sharing optimization, and iii) a producer consumer optimization based on speculative execution. With execution driven simulation and a set of nine benchmarks we show that i) for the first two optimizations, instruction-based prediction, using few predictor entries per node, outpaces address based schemes, and (ii) for the producer consumer optimization which uses speculative execution, low mis speculation rates show promise for performance improvements.

Patent
10 Nov 1999
TL;DR: In this paper, a memory system having a main memory coupled with a plurality of parallel virtual access channels is described, each of which provides a set of memory access resources for controlling the main memory.
Abstract: A memory system having a main memory which is coupled to a plurality of parallel virtual access channels. Each of the virtual access channels provides a set of memory access resources for controlling the main memory. These memory access resources include cache resources (including cache chaining), burst mode operation control and precharge operation control. A plurality of the virtual access channels are cacheable virtual access channels, each of which includes a channel row cache memory for storing one or more cache entries and a channel row address register for storing corresponding cache address entries. One or more non-cacheable virtual access channels are provided by a bus bypass circuit. Each virtual access channel is addressable, such that particular memory masters can be assigned to access particular virtual access channels.

Patent
Matthias A. Blumrich1
31 Mar 1999
TL;DR: In this article, a cache memory shared among a plurality of separate, disjoint entities each having a disjointed address space, includes a cache segregator for dynamically segregating a storage space allocated to each entity of the entities such that no interference occurs with respective ones of the entity.
Abstract: A cache memory shared among a plurality of separate, disjoint entities each having a disjoint address space, includes a cache segregator for dynamically segregating a storage space allocated to each entity of the entities such that no interference occurs with respective ones of the entities. A multiprocessor system including the cache memory, a method and a signal bearing medium for storing a program embodying the method also are provided.

Proceedings ArticleDOI
17 Aug 1999
TL;DR: This work proposes, implements, and evaluates a series of run-time techniques for dynamic analysis of the program instruction access behavior, which are then used to preactively guide the access of the LO-Cache, an additional mini cache located between the instruction cache (I-Cache) and the CPU core.
Abstract: In this paper, we propose a technique that uses an additional mini cache, the LO-Cache, located between the instruction cache (I-Cache) and the CPU core. This mechanism can provide the instruction stream to the data path and, when managed properly, it can effectively eliminate the need for high utilization of the more expensive I-Cache. In this work, we propose, implement, and evaluate a series of run-time techniques for dynamic analysis of the program instruction access behavior, which are then used to preactively guide the access of the LO-Cache. The basic idea is that only the most frequently executed portions of the code should be stored in the LO-Cache since this is where the program spends most of its time. We present experimental results to evaluate the effectiveness of our scheme in terms of performance and energy dissipation for a series of SPEC95 benchmarks. We also discuss the performance and energy tradeoffs that are involved in these dynamic schemes.

Patent
19 Feb 1999
TL;DR: In this article, a method and apparatus for accessing a cache memory of a computer graphics system, including a frame buffer memory having a graphics memory for storing pixel data for ultimate supply to a video display device, was presented.
Abstract: A method and apparatus for accessing a cache memory of a computer graphics system, the apparatus including a frame buffer memory having a graphics memory for storing pixel data for ultimate supply to a video display device, a read cache memory for storing data received from the graphics memory, and a write cache memory for storing data received externally of the frame buffer and data that is to be written into the graphics memory. Also included is a frame buffer controller for controlling access to the graphics memory and read and write cache memories. The frame buffer controller includes a cache first in, first out (FIFO) memory pipeline for temporarily storing pixel data prior to supply thereof to the cache memories.

Patent
26 Jan 1999
TL;DR: In this paper, a relatively high-speed, intermediate-volume storage device is operated as a user-configurable cache, where requests to access a mass storage device such as a disk or tape are intercepted by a device driver that compares the access request against a directory of the contents of the user configurable cache.
Abstract: An apparatus and method for accessing data in a computer system. A relatively high-speed, intermediate-volume storage device is operated as a user-configurable cache. Requests to access a mass storage device such as a disk or tape are intercepted by a device driver that compares the access request against a directory of the contents of the user-configurable cache. If the user-configurable cache contains the data sought to be accessed, the access request is carried out in the user-configurable cache instead of being forwarded to the device driver for the target mass storage device. Because the user-cache is implemented using memory having a dramatically shorter access time than most mechanical mass storage devices, the access request is fulfilled much more quickly than if the originally intended mass storage device was accessed. Data is preloaded and responsively cached in the user-configurable cache memory based on user preferences.

Proceedings ArticleDOI
10 Oct 1999
TL;DR: This work extends the work proposed by J. Kin et al. (1997), in which an extra, small cache is inserted between the CPU data path and the L1 cache and serves to filter most of the references initiated from the CPU.
Abstract: Energy dissipated in on-chip caches represents a substantial portion in the energy budget of today's processors. Extrapolating current trends, this portion is likely to increase in the near future, since the devices devoted to the caches occupy an increasingly larger percentage of the total area of the chip. We extend the work proposed by J. Kin et al. (1997), in which an extra, small cache (called filter cache) is inserted between the CPU data path and the L1 cache and serves to filter most of the references initiated from the CPU. In our scheme, the compiler is used to generate code that exploits the new memory hierarchy and reduces the possibility of a miss in the extra cache. Experimental results across a wide range of SPEC95 benchmarks show that this cache, which we call L-Cache, has a small performance overhead with respect to the scheme without any extra caches, and provides substantial energy savings. The L-Cache is placed between the CPU and the I-Cache. The D-Cache subsystem is not modified. Since the L-Cache is much smaller, and thus, has a smaller access time than the I-Cache, this scheme can also be used for performance improvements provided that the hit rate in the L-Cache is very high. In our experimental results, we show that the L-Cache does indeed improve performance in some cases.

Patent
30 Sep 1999
TL;DR: In this paper, a multiple level cache structure and multiple level caching method that distributes I/O processing loads including caching operations between processors to provide higher performance processing, especially in a server environment is presented.
Abstract: This inventive provides a multiple level cache structure and multiple level caching method that distributes I/O processing loads including caching operations between processors to provide higher performance I/O processing, especially in a server environment. A method of achieving optimal data throughput by taking full advantage of multiple processing resources is disclosed. A method for managing the allocation of the data caches to optimize the host access time and parity generation is disclosed. A cache allocation for RAID stripes guaranteed to provide fast access times for the XOR engine by ensuring that all cache lines are allocated from the same cache level is disclosed. Allocation of cache lines for RAID levels which do not require parity generation and are allocated in such manner as to maximize utilization of the memory bandwidth is disclosed. Parity generation which is optimized for use of the processor least utilized at the time the cache lines are allocated, thereby providing for dynamic load balancing amongst the multiple processing resources, is disclosed. An inventive cache line descriptor for maintaining information about which cache data pool the cache line resides within, and an inventive cache line descriptor which includes enhancements to allow for movement of cache data from one cache level to another is disclosed. A cache line descriptor with enhancements for tracking the cache within which RAID stripe cache lines siblings reside is disclosed. System, apparatus, computer program product, and methods to support these aspects alone and in combination are also provided.