scispace - formally typeset
Search or ask a question

Showing papers on "Cache pollution published in 1998"


Proceedings ArticleDOI
01 Oct 1998
TL;DR: This paper proposes a new protocol called "Summary Cache"; each proxy keeps a summary of the URLs of cached documents of each participating proxy and checks these summaries for potential hits before sending any queries, which enables cache sharing among a large number of proxies.
Abstract: The sharing of caches among Web proxies is an important technique to reduce Web traffic and alleviate network bottlenecks. Nevertheless it is not widely deployed due to the overhead of existing protocols. In this paper we propose a new protocol called "Summary Cache"; each proxy keeps a summary of the URLs of cached documents of each participating proxy and checks these summaries for potential hits before sending any queries. Two factors contribute to the low overhead: the summaries are updated only periodically, and the summary representations are economical --- as low as 8 bits per entry. Using trace-driven simulations and a prototype implementation, we show that compared to the existing Internet Cache Protocol (ICP), Summary Cache reduces the number of inter-cache messages by a factor of 25 to 60, reduces the bandwidth consumption by over 50%, and eliminates between 30% to 95% of the CPU overhead, while at the same time maintaining almost the same hit ratio as ICP. Hence Summary Cache enables cache sharing among a large number of proxies.

446 citations


Patent
Eric Horvitz1
06 Feb 1998
TL;DR: In this paper, the browser prefetches and stores each web page (or component thereof) in its local cache, providing a suitable and preferably visual indication, through its graphical user interface, to a user that this item has been fetched and stored.
Abstract: A technique, specifically apparatus and accompanying methods for use therein, that, through continual computation, harnesses available computer resources during periods of low processing activity and low network activity, such as idle time, for prefetching, e.g., web pages, or pre-selected portions thereof, into local cache of a client computer. As the browser prefetches and stores each web page (or component thereof) in its local cache, the browser provides a suitable and preferably visual indication, through its graphical user interface, to a user that this item has been fetched and stored. Consequently, the user can quickly and visually perceive that a particular item (i.e., a “fresh” page or portion) has just been prefetched and which (s)he can now quickly access from local cache. As such additional items are cached, the browser can change the color of the displayed hotlink associated with each of the items then stored in cache so as, through color coding, to reflect their relative latency (“aging”) in cache.

355 citations


Patent
23 Jul 1998
TL;DR: The NI Cache as discussed by the authors is a network infrastructure cache that provides proxy file services to a plurality of client workstations concurrently requesting access to file data stored on a server through a network interface.
Abstract: A network-infrastructure cache ("NI Cache") transparently provides proxy file services to a plurality of client workstations concurrently requesting access to file data stored on a server. The NI Cache includes a network interface that connects to a digital computer network. A file-request service-module of the NI Cache receives and responds to network-file-services-protocol requests from workstations through the network interface. A cache, also included in the NI Cache, stores data that is transmitted back to the workstations. A file-request generation-module, also included in the NI Cache, transmits requests for data to the server, and receives responses from the server that include data missing from the cache.

331 citations


Proceedings ArticleDOI
31 Jan 1998
TL;DR: This proposal uses distributed caches to eliminate the latency and bandwidth problems of the ARB and conceptually unifies cache coherence and speculative versioning by using an organization similar to snooping bus-based coherent caches.
Abstract: Dependences among loads and stores whose addresses are unknown hinder the extraction of instruction level parallelism during the execution of a sequential program. Such ambiguous memory dependences can be overcome by memory dependence speculation which enables a load or store to be speculatively executed before the addresses of all preceding loads and stores are known. Furthermore, multiple speculative stores to a memory location create multiple speculative versions of the location. Program order among the speculative versions must be tracked to maintain sequential semantics. A previously proposed approach, the address resolution buffer (ARB) uses a centralized buffer to support speculative versions. Our proposal, called the speculative versioning cache (SVC), uses distributed caches to eliminate the latency and bandwidth problems of the ARB. The SVC conceptually unifies cache coherence and speculative versioning by using an organization similar to snooping bus-based coherent caches. A preliminary evaluation for the multiscalar architecture shows that hit latency is an important factor affecting performance, and private cache solutions trade-off hit rate for hit latency.

317 citations


Proceedings ArticleDOI
01 Oct 1998
TL;DR: Results show that profile driven data placement significantly reduces the data miss rate by 24% on average, and a compiler directed approach that creates an address placement for the stack, global variables, heap objects, and constants in order to reduce data cache misses is presented.
Abstract: As the gap between memory and processor speeds continues to widen, cache eficiency is an increasingly important component of processor performance. Compiler techniques have been used to improve instruction cache pet$ormance by mapping code with temporal locality to different cache blocks in the virtual address space eliminating cache conflicts. These code placement techniques can be applied directly to the problem of placing data for improved data cache pedormance.In this paper we present a general framework for Cache Conscious Data Placement. This is a compiler directed approach that creates an address placement for the stack (local variables), global variables, heap objects, and constants in order to reduce data cache misses. The placement of data objects is guided by a temporal relationship graph between objects generated via profiling. Our results show that profile driven data placement significantly reduces the data miss rate by 24% on average.

297 citations


Patent
09 Oct 1998
TL;DR: In this article, a request can be forwarded to a cooperating cache server if the requested object cannot be found locally, and the load is balanced by shifting some or all of the forwarded requests from an overloaded cache server to a less loaded one.
Abstract: In a system including a collection of cooperating cache servers, such as proxy cache servers, a request can be forwarded to a cooperating cache server if the requested object cannot be found locally. An overload condition is detected if for example, due to reference skew, some objects are in high demand by all the clients and the cache servers that contain those hot objects become overloaded due to forwarded requests. In response, the load is balanced by shifting some or all of the forwarded requests from an overloaded cache server to a less loaded one. Both centralized and distributed load balancing environments are described.

286 citations


Proceedings ArticleDOI
01 Oct 1998
TL;DR: This paper studies a technique for using a generational garbage collector to reorganize data structures to produce a cache-conscious data layout, in which objects with high temporal affinity are placed next to each other, so that they are likely to reside in the same cache block.
Abstract: The cost of accessing main memory is increasing. Machine designers have tried to mitigate the consequences of the processor and memory technology trends underlying this increasing gap with a variety of techniques to reduce or tolerate memory latency. These techniques, unfortunately, are only occasionally successful for pointer-manipulating programs. Recent research has demonstrated the value of a complementary approach, in which pointer-based data structures are reorganized to improve cache locality.This paper studies a technique for using a generational garbage collector to reorganize data structures to produce a cache-conscious data layout, in which objects with high temporal affinity are placed next to each other, so that they are likely to reside in the same cache block. The paper explains how to collect, with low overhead, real-time profiling information about data access patterns in object-oriented languages, and describes a new copying algorithm that utilizes this information to produce a cache-conscious object layout.Preliminary results show that this technique reduces cache miss rates by 21--42%, and improves program performance by 14--37% over Cheney's algorithm. We also compare our layouts against those produced by the Wilson-Lam-Moher algorithm, which attempts to improve program locality at the page level. Our cache-conscious object layouts reduces cache miss rates by 20--41% and improves program performance by 18--31% over their algorithm, indicating that improving locality at the page level is not necessarily beneficial at the cache level.

234 citations


Proceedings ArticleDOI
16 Apr 1998
TL;DR: This paper describes an alternative approach to exploit spatial locality available in data caches called Spatial Footprint Predictor (SFP), which predicts which portions of a cache block will get used before getting evicted, and shows that the miss rate of the cache is improved by 18% in addition to a significant reduction in the bandwidth requirement.
Abstract: Modern cache designs exploit spatial locality by fetching large blocks of data called cache lines on a cache miss. Subsequent references to words within the same cache line result in cache hits. Although this approach benefits from spatial locality, less than half of the data brought into the cache gets used before eviction. The unused portion of the cache line negatively impacts performance by wasting bandwidth and polluting the cache by replacing potentially useful data that would otherwise remain in the cache.This paper describes an alternative approach to exploit spatial locality available in data caches. On a cache miss, our mechanism, called Spatial Footprint Predictor (SFP), predicts which portions of a cache block will get used before getting evicted. The high accuracy of the predictor allows us to exploit spatial locality exhibited in larger blocks of data yielding better miss ratios without significantly impacting the memory access latencies. Our evaluation of this mechanism shows that the miss rate of the cache is improved, on average, by 18% in addition to a significant reduction in the bandwidth requirement.

174 citations


Patent
18 Nov 1998
TL;DR: In this paper, the cache manager attempts to free space needed for caching the next object by deleting files from the cache if no server updates are pending and if such deletion will provide the needed space.
Abstract: A system and method for managing a mobile file system cache to maximize data storage and reduce problems from cache full conditions. Cache management automatically determines when the space available in the cache falls below a user-specified threshold. The cache manager attempts to free space needed for caching the next object. Files are deleted from the cache if no server updates are pending and if such deletion will provide the needed space. If automatic deletion does not provide sufficient space, the user is prompted for action. The system user can control the cache by increasing or reducing its size and drive allocation and can explicitly evict clean files from the cache. Cache expansion can be to logical or physical storage devices different than those on which the original cache is stored. The system enables separate storage of temporary files allowing identification and deletion of such files.

136 citations


Proceedings ArticleDOI
30 Mar 1998
TL;DR: In this article, a zero copy message transfer with a pin-down cache technique was proposed, which reuses the pinned-down area to decrease the number of calls to pindown and release primitives.
Abstract: The overhead of copying data through the central processor by a message passing protocol limits data transfer bandwidth If the network interface directly transfers the user's memory to the network by issuing DMA, such data copies may be eliminated Since the DMA facility accesses the physical memory address space, user virtual memory must be pinned down to a physical memory location before the message is sent or received If each message transfer involves pin-down and release kernel primitives, message transfer bandwidth will decrease since those primitives are quite expensive The authors propose a zero copy message transfer with a pin-down cache technique which reuses the pinned-down area to decrease the number of calls to pin-down and release primitives The proposed facility has been implemented in the PM low-level communication library on the RWC PC Cluster II, consisting of 64 Pentium Pro 200 MHz CPUs connected by a Myricom Myrinet network, and running NetBSD The PM achieves 1088 MBytes/sec for a 100% pin-down cache hit ratio and 787 MBytes/sec for all pin-down cache miss The MPI library has been implemented on top of PM According to the NAS parallel benchmarks result, an application is still better performance in case that cache miss ratio is very high

133 citations


Patent
26 Mar 1998
TL;DR: In this article, a method of a computer graphics system recirculates texture cache misses into a graphics pipeline without stalling the graphics pipeline, increasing the processing speed of the computer graphics systems.
Abstract: A method of a computer graphics system recirculates texture cache misses into a graphics pipeline without stalling the graphics pipeline, increasing the processing speed of the computer graphics system. The method reads data from a texture cache memory by a read request placed in the graphics pipeline sequence, then reads the data from the texture cache memory if the data is stored in the texture cache memory and places the data in the pipeline sequence. If the data is not stored in the texture cache memory, the method recirculates the read request in the pipeline sequence by indicating in the pipeline sequence that the data is not stored in the texture cache memory, placing the read request at a subsequent, determined place in the pipeline sequence, reading the data into the texture cache memory from a main memory, and executing the read request from the subsequent, determined place and after the data has been read into the texture cache memory.

Proceedings ArticleDOI
10 Aug 1998
TL;DR: It is shown that, by using buffers, energy consumption of the memory subsystem may be reduced by as much as 13% for certain data cache configurations and by asmuch as 23% forcertain instruction cache configurations without adversely effecting processor performance or on-chip energy consumption.
Abstract: In this paper, we propose several different data and instruction cache configurations and analyze their power as well as performance implications on the processor. Unlike most existing work in low power microprocessor design, we explore a high performance processor with the latest innovations for performance. Using a detailed, architectural-level simulator, we evaluate full system performance using several different power/performance sensitive cache configurations such as increasing cache size or associatively and including buffers along side L1 caches. We then use the information obtained from the simulator to calculate the energy consumption of the memory hierarchy of the system. As an alternative to simply increasing cache associatively or size to reduce lower-level memory energy consumption (which may have a detrimental effect on on-chip energy consumption), we show that, by using buffers, energy consumption of the memory subsystem may be reduced by as much as 13% for certain data cache configurations and by as much as 23% for certain instruction cache configurations without adversely effecting processor performance or on-chip energy consumption.

Book ChapterDOI
01 Jun 1998
TL;DR: In this paper, the authors examine how data dependence analysis and program restructuring methods to increase data locality can be used to determine worst case bounds on cache misses and present a persistence analysis on sets of possibly referenced memory locations (e.g., arrays).
Abstract: In the presence of data or combined data/instruction caches there can be memory references that may access multiple memory locations such as those used to implement array references in loops. We examine how data dependence analysis and program restructuring methods to increase data locality can be used to determine worst case bounds on cache misses. To complement these methods we present a persistence analysis on sets of possibly referenced memory locations (e.g., arrays). This analysis determines memory locations that survive in the cache thus providing effective and efficient means to compute an upper bound on the number of possible cache misses.

Patent
27 Aug 1998
TL;DR: In this article, a method and system for caching objects and replacing cached objects in an object-transfer environment maintain a dynamic indicator (Pr(f)) for each cached object, with the dynamic indicator being responsive to the frequency of requests for the object and being indicative of the time of storing the cached object relative to storing other cached objects.
Abstract: A method and system for caching objects and replacing cached objects in an object-transfer environment maintain a dynamic indicator (Pr(f)) for each cached object, with the dynamic indicator being responsive to the frequency of requests for the object and being indicative of the time of storing the cached object relative to storing other cached objects. In a preferred embodiment, the size of the object is also a factor in determining the dynamic indicator of the object. In the most preferred embodiment, the cost of obtaining the object is also a factor. A count of the frequency of requests and the use of the relative time of storage counterbalance each other with respect to maintaining a cached object in local cache. That is, a high frequency of requests favors maintaining the object in cache, but a long period of cache favors evicting the object. Thus, cache pollution is less likely to occur.

Proceedings ArticleDOI
01 May 1998
TL;DR: This work proposes two algorithms to compress code in a space-efficient and simple to decompress way, one which is independent of the instruction set and another which depends on the Instruction set.
Abstract: Memory is one of the most restricted resources in many modern embedded systems. Code compression can provide substantial savings in terms of size. In a compressed code CPU, a cache miss triggers the decompression of a main memory block, before it gets transferred to the cache. Because the code must be decompressible starting from any point (or at least at cache block boundaries), most file-oriented compression techniques cannot be used. We propose two algorithms to compress code in a space-efficient and simple to decompress way, one which is independent of the instruction set and another which depends on the instruction set. We perform experiments on true instruction sets, a typical RISC (MIPS) and a typical CISC (x86) and compare our results to existing file-oriented compression algorithms.

Patent
09 Oct 1998
TL;DR: A dynamically configurable replacement technique in a unified or shared cache reduces domination by a particular functional unit or an application such as unified instruction/data caching by limiting the eviction ability to selected cache regions based on over utilization of the cache.
Abstract: A dynamically configurable replacement technique in a unified or shared cache reduces domination by a particular functional unit or an application such as unified instruction/data caching by limiting the eviction ability to selected cache regions based on over utilization of the cache by a particular functional unit or application A specific application includes a highly integrated multimedia processor employing a tightly coupled shared cache between central processing and graphics units wherein the eviction ability of the graphics unit is limited to selected cache regions when the graphics unit over utilizes the cache Dynamic configurability can take the form of a programmable register that enables either one of a plurality of replacement modes based on captured statistics such as measurement of cache misses by a particular functional unit or application

Patent
31 Jul 1998
TL;DR: In this paper, a cache memory replacement algorithm replaces cache lines based on the likelihood that cache lines will not be needed soon, and the cache lines selected for replacement contain the most speculative data in the cache that is least likely to be needed.
Abstract: A cache memory replacement algorithm replaces cache lines based on the likelihood that cache lines will not be needed soon. A cache memory in accordance with the present invention includes a plurality of cache lines that are accessed associatively, with a count entry associated with each cache line storing a count value that defines a replacement class. The count entry is typically loaded with a count value when the cache line is accessed, with the count value indicating the likelihood that the contents of cache lines will be needed soon. In other words, data which is likely to be needed soon is assigned a higher replacement class, while data that is more speculative and less likely to be needed soon is assigned a lower replacement class. When the cache memory becomes full, the replacement algorithm selects for replacement those cache lines having the lowest replacement class. Accordingly, the cache lines selected for replacement contain the most speculative data in the cache that is least likely to be needed soon.

Patent
09 Dec 1998
TL;DR: A very fast, memory efficient, highly expandable, highly efficient CCNUMA processing system based on a hardware architecture that minimizes system bus contention, maximizes processing forward progress by maintaining strong ordering and avoiding retries, and implements a full-map directory structure cache coherency protocol is implemented in this paper.
Abstract: A very fast, memory efficient, highly expandable, highly efficient CCNUMA processing system based on a hardware architecture that minimizes system bus contention, maximizes processing forward progress by maintaining strong ordering and avoiding retries, and implements a full-map directory structure cache coherency protocol. A Cache Coherent Non-Uniform Memory Access (CCNUMA) architecture is implemented in a system comprising a plurality of integrated modules each consisting of a motherboard and two daughterboards. The daughterboards, which plug into the motherboard, each contain two Job Processors (JPs), cache memory, and input/output (I/O) capabilities. Located directly on the motherboard are additional integrated I/O capabilities in the form of two Small Computer System Interfaces (SCSI) and one Local Area Network (LAN) interface. The motherboard includes main memory, a memory controller (MC) and directory DRAMs for cache coherency. The motherboard also includes GTL backpanel interface logic, system clock generation and distribution logic, and local resources including a micro-controller for system initialization. A crossbar switch connects the various logic blocks together. A fully loaded motherboard contains 2 JP daughterboards, two PCI expansion boards, and up to 512 MB of main memory. Each daughterboard contains two 50 MHz Motorola 88110 JP complexes, having an associated 88410 cache controller and 1 MB Level 2 Cache. A single 16 MB third level write-through cache is also provided and is controlled by a third level cache controller.

Patent
Michael Dean Snyder1
03 Aug 1998
TL;DR: In this article, the authors propose a mechanism for preventing DST line fetches from occupying the last available entries in a cache miss queue (50) of the data cache and MMU.
Abstract: A data processing system (10) includes a mechanism for preventing DST line fetches from occupying the last available entries in a cache miss queue (50) of the data cache and MMU (16). This is done by setting a threshold value of available cache miss queue (50) buffers over which a DST access is not allowed. This prevents the cache miss queue (50) from filling up and preventing normal load and store accesses from using the cache miss queue (50).

Patent
09 Sep 1998
TL;DR: In this article, the authors present a method and cache management for a bridge or bridge/router providing high speed, flexible address cache management. But, it does not support a 4-way set associative cache to store the network addresses.
Abstract: A method and cache management for a bridge or bridge/router providing high speed, flexible address cache management. The unit maintains a network address cache (28) and an age table (130), searches the cache (28) for layer 2 and layer 3 addresses from received frame headers, and returns address search results. The unit includes an interface (102) permitting processor manipulation of the cache (28) and age table (130), and supports a 4-way set associative cache to store the network addresses. A cyclic redundancy code for each address to be looked up in the cache (28) is used as an index into the cache. If a cache thrash rate exceeds a predetermined threshold, CRC table values can be rewritten. Four time-sliced cache lookup units (120) are provided, each consisting of a cache lookup controller (118) for comparing a received network address to an address retrieved from an identified cache set.

Patent
01 Aug 1998
TL;DR: In this article, a cache memory is divided into cache partitions, each cache partition having a plurality of addressable storage locations for holding items in the cache memory, and a partition indicator is allocated to each process identifying which of the cache partitions is to be used for hold items for use in the execution of that process.
Abstract: A method of operating a cache memory is described in a system in which a processor is capable of executing a plurality of processes, each process including a sequence of instructions. In the method a cache memory is divided into cache partitions, each cache partition having a plurality of addressable storage locations for holding items in the cache memory. A partition indicator is allocated to each process identifying which, if any, of said cache partitions is to be used for holding items for use in the execution of that process. When the processor requests an item from main memory during execution of said current process and that item is not held in the cache memory, the item is fetched from main memory and loaded into one of the plurality of addressable storage locations in the identified cache partition.

Patent
08 Jun 1998
TL;DR: In this article, the authors propose a method and system for caching information objects transmitted using a computer network, where the cache engine determines directly when and where to store those objects in a memory (such as RAM) and a mass storage ( such as one or more disk drives), so as to optimally write those objects to mass storage and later read them from mass storage, without having to maintain them persistently.
Abstract: The invention provides a method and system for caching information objects transmitted using a computer network A cache engine determines directly when and where to store those objects in a memory (such as RAM) and mass storage (such as one or more disk drives), so as to optimally write those objects to mass storage and later read them from mass storage, without having to maintain them persistently The cache engine actively allocates those objects to memory or to disk, determines where on disk to store those objects, retrieves those objects in response to their network identifiers (such as their URLs), and determines which objects to remove from the cache so as to maintain sufficient operating space The cache engine collects information to be written to disk in write episodes, so as to maximize efficiency when writing information to disk and so as to maximize efficiency when later reading that information from disk The cache engine performs write episodes so as to atomically commit changes to disk during each write episode, so the cache engine does not fail in response to loss of power or storage, or other intermediate failure of portions of the cache The cache engine also stores key system objects on each one of a plurality of disks, so as to maintain the cache holographic in the sense that loss of any subset of the disks merely decreases the amount of available cache The cache engine also collects information to be deleted from disk in delete episodes, so as to maximize efficiency when deleting information from disk and so as to maximize efficiency when later writing to those areas having former deleted information The cache engine responds to the addition or deletion of disks as the expansion or contraction of the amount of available cache

Proceedings ArticleDOI
13 Jul 1998
TL;DR: This paper presents a comparative evaluation of two approaches that utilize reuse information for more efficiently managing the firstlevel cache, and shows that using effective address reuse information performs better than using program counter reuse information.
Abstract: 1. ABSTRACT As microprocessor speeds continue to outgrow memory subsystem speeds, minimizing the average data access time grows in importance. As current data caches are often poorly and inefficiently managed, a good management technique can improve the average data access time. This paper presents a comparative evaluation of two approaches that utilize reuse information for more efficiently managing the firstlevel cache. While one approach is based on the effective address of the data being referenced, the other uses the program counter of the memory instruction generating the reference. Our evaluations show that using effective address reuse information performs better than using program counter reuse information. In addition, we show that the Victim cache performs best for multi-lateral caches with a direct-mapped main cache and high L2 cache latency, while the NTS (effective-addressbased) approach performs better as the L2 latency decreases or the associativity of the main cache increases.

Proceedings ArticleDOI
28 Jul 1998
TL;DR: This work has developed a distributed Web server called Swala, in which the nodes cooperatively cache the results of CGI requests, and the cache meta-data is stored in a replicated global cache directory.
Abstract: We propose a new method for improving the average response time of Web servers by cooperatively caching the results of requests for dynamic content. The work is motivated by our recent study of access logs from the Alexandria Digital Library server at UCSB, which demonstrates that approximately a 30 percent decrease in average response time could be achieved by caching dynamically generated content. We have developed a distributed Web server called Swala, in which the nodes cooperatively cache the results of CGI requests, and the cache meta-data is stored in a replicated global cache directory. Our experiments show that the single-node performance of Swala without caching is comparable to the Netscape Enterprise server, that considerable speedups are obtained using caching, and that the cache hit ratio is substantially higher with cooperative cache than with stand-alone cache.

Patent
24 Sep 1998
TL;DR: In this paper, the plurality of servers form a multi-cast hierarchy dynamically reconstructed by virtue of mutual support and communication of server status, cache directory and validation is performed on the hierarchy.
Abstract: In order to effectively make the grasp of operating conditions of a plurality of servers and a cache management in an information system without increasing a time/labor taken by an administrator, the plurality of servers forms a multi-cast hierarchy dynamically reconstructed by virtue of mutual support and the communication of server status, cache directory and validation is performed on the hierarchy. The administrator has not a need of management for cooperation between servers excepting the designation of some other servers for startup thereof. A cache between servers is shared through the exchange of a cache directory and a validation time is reduced, thereby shortening the response time for users.

Patent
24 Jul 1998
TL;DR: In this paper, a data transfer request is mapped to a cache device object and data transfer is performed based on the location and if needed, the determined storage devices, which provides the host computer with a virtual view of the storage devices.
Abstract: An input/output processor provides device virtualization “on-board” through the use of a dedicated IO cache memory A computer system includes at least one host processor and associated main memory each with access to a system bus Each input/output processor is also connected to the system bus through an expansion bus IO adapters within the input/output processor each connect at least one storage device to the expansion bus Also connected to the expansion bus is the cache memory and a control logic The control logic receives a data transfer request from a requesting host processor The data transfer request is mapped to a cache device object The cache device object has associated data maintained in the cache memory If any storage device is required for the data transfer, the data transfer request is mapped to the storage device capable of servicing the request A location in cache memory is determined based on the mapped cache device object The data transfer is performed based on the location and, if needed, the determined storage devices This provides the host computer with a virtual view of the storage devices

Patent
David A. Luick1
03 Feb 1998
TL;DR: In this paper, the authors present a predictive instruction cache system for a VLIW processor, which includes a first cache, a real or virtual second cache, and a history look-up table for storing relations between first instructions and second instructions in the second cache.
Abstract: Disclosed is a predictive instruction cache system, and the method it embodies, for a VLIW processor. The system comprises: a first cache; a real or virtual second cache for storing a subset of the instructions in the second cache; and a real or virtual history look-up table for storing relations between first instructions and second instructions in the second cache. If a first instruction is located in a stage of the pipeline, then one of the relations will predict that a second instruction will be needed in the same stage a predetermined time later. The first cache can be physically distinct from the second cache, but preferably is not, i.e., the second cache is a virtual array. The history look-up table can also be physically distinct from the first cache, but preferably is not, i.e., the history look-up table is a virtual look-up table. The first cache is organized as entries. Each entry has a first portion for the first instruction and a second portion for a branch-to address indicator pointing to the second instruction. For a given first instruction, a new branch-to address indicator independently can be stored in the second field to replace an old branch-to address indicator and so reflect a revised prediction. Alternatively, redundant data fields in the parcels of the VLIWs are used to store the branch-to address guesses so that a physically distinct second portion can be eliminated in the entries of the first cache.

Patent
Sharad Mehrotra1
21 Jan 1998
TL;DR: In this article, a multi-level cache and method for operation thereof is presented for processing multiple cache system accesses simultaneously and handling the interactions between the queues of the cache levels, where controller logic is also provided for controlling interaction between the miss queue and the write queue.
Abstract: A multi-level cache and method for operation thereof is presented for processing multiple cache system accesses simultaneously and handling the interactions between the queues of the cache levels. The cache unit includes a non-blocking cache receiving data access requests from a functional unit in a processor, and a miss queue storing entries corresponding to data access requests not serviced by the non-blocking cache. A victim queue stores entries of the non-blocking cache which have been evicted from the non-blocking cache, while a write queue buffers write requests into the non-blocking cache. Controller logic is provided for controlling interaction between the miss queue and the victim queue. Controller logic is also provided for controlling interaction between the miss queue and the write queue. Controller logic is also provided for controlling interaction between the victim queue and the miss queue for processing cache misses.

Proceedings ArticleDOI
01 Oct 1998
TL;DR: This paper describes a technique that dynamically identifies underutilized cache frames and effectively utilizes the cache frames they occupy to more accurately approximate the global least-recently-used replacement policy while maintaining the fast access time of a direct-mapped cache.
Abstract: Memory references exhibit locality and are therefore not uniformly distributed across the sets of a cache. This skew reduces the effectiveness of a cache because it results in the caching of a considerable number of less-recently-used lines which are less likely to be re-referenced before they are replaced. In this paper, we describe a technique that dynamically identifies these less-recently-used lines and effectively utilizes the cache frames they occupy to more accurately approximate the global least-recently-used replacement policy while maintaining the fast access time of a direct-mapped cache. We also explore the idea of using these underutilized cache frames to reduce cache misses through data prefetching. In the proposed design, the possible locations that a line can reside in is not predetermined. Instead, the cache is dynamically partitioned into groups of cache lines. Because both the total number of groups and the individual group associativity adapt to the dynamic reference pattern, we call this design the adaptive group-associative cache. Performance evaluation using trace-driven simulations of the TPC-C benchmark and selected programs from the SPEC95 benchmark suite shows that the group-associative cache is able to achieve a hit ratio that is consistently better than that of a 4-way set-associative cache. For some of the workloads, the hit ratio approaches that of a fully-associative cache.

Journal ArticleDOI
TL;DR: The characterization of power dissipation in on-chip cache memories reveals that the memory peripheral interface circuits and bit array dissipate comparable power.
Abstract: In this paper, we present the characterization and design of energy-efficient, on chip cache memories. The characterization of power dissipation in on-chip cache memories reveals that the memory peripheral interface circuits and bit array dissipate comparable power. To optimize performance and power in a processor's cache, a multidivided module (MDM) cache architecture is proposed to conserve energy in the bit array as well as the memory peripheral circuits. Compared to a conventional, nondivided, 16-kB cache, the latency and power of the MDM cache are reduced by a factor of 1.9 and 4.6, respectively. Based on the MDM cache architecture, the energy efficiency of the complete memory hierarchy is analyzed with respect to cache parameters in a multilevel processor cache design. This analysis was conducted by executing the SPECint92 benchmark programs with the miss ratios for reduced instruction set computer (RISC) and complex instruction set computer (CISC) machines.