scispace - formally typeset
Search or ask a question

Showing papers on "Smart Cache published in 1996"


ReportDOI
22 Jan 1996
TL;DR: The design and performance of a hierarchical proxy-cache designed to make Internet information systems scale better are discussed, and performance measurements indicate that hierarchy does not measurably increase access latency.
Abstract: This paper discusses the design and performance of a hierarchical proxy-cache designed to make Internet information systems scale better. The design was motivated by our earlier trace-driven simulation study of Internet traffic. We challenge the conventional wisdom that the benefits of hierarchical file caching do not merit the costs, and believe the issue merits reconsideration in the Internet environment. The cache implementation supports a highly concurrent stream of requests. We present performance measurements that show that our cache outperforms other popular Internet cache implementations by an order of magnitude under concurrent load. These measurements indicate that hierarchy does not measurably increase access latency. Our software can also be configured as a Web-server accelerator; we present data that our httpd-accelerator is ten times faster than Netscape's Netsite and NCSA 1.4 servers. Finally, we relate our experience fitting the cache into the increasingly complex and operational world of Internet information systems, including issues related to security, transparency to cache-unaware clients, and the role of file systems in support of ubiquitous wide-area information systems.

853 citations


Proceedings ArticleDOI
02 Dec 1996
TL;DR: It is shown that the trace cache's efficient, low latency approach enables it to outperform more complex mechanisms that work solely out of the instruction cache.
Abstract: As the issue width of superscalar processors is increased, instruction fetch bandwidth requirements will also increase. It will become necessary to fetch multiple basic blocks per cycle. Conventional instruction caches hinder this effort because long instruction sequences are not always in contiguous cache locations. We propose supplementing the conventional instruction cache with a trace cache. This structure caches traces of the dynamic instruction stream, so instructions that are otherwise noncontiguous appear contiguous. For the Instruction Benchmark Suite (IBS) and SPEC92 integer benchmarks, a 4 kilobyte trace cache improves performance on average by 28% over conventional sequential fetching. Further, it is shown that the trace cache's efficient, low latency approach enables it to outperform more complex mechanisms that work solely out of the instruction cache.

637 citations


Proceedings Article
03 Sep 1996
TL;DR: A semantic model for client-side caching and replacement in a client-server database system and compared to page caching and tuple caching strategies is proposed and validated with a detailed performance study.
Abstract: We propose a semantic model for client-side caching and replacement in a client-server database system and compare this approach to page caching and tuple caching strategies. Our caching model is based on, and derives its advantages from, three key ideas. First, the client maintains a semantic description of the data in its cache,which allows for a compact specification, as a remainder query, of the tuples needed to answer a query that are not available in the cache. Second, usage information for replacement policies is maintained in an adaptive fashion for semantic regions, which are associated with collections of tuples. This avoids the high overheads of tuple caching and, unlike page caching, is insensitive to bad clustering. Third, maintaining a semantic description of cached data enables the use of sophisticated value functions that incorporate semantic notions of locality, not just LRU or MRU, for cache replacement. We validate these ideas with a detailed performance study that includes traditional workloads as well as a workload motivated by a mobile navigation application.

610 citations


Proceedings Article
22 Jan 1996
TL;DR: Using trace-driven simulation, it is shown that a weak cache consistency protocol (the one used in the Alex ftp cache) reduces network bandwidth consumption and server load more than either time-to-live fields or an invalidation protocol and can be tuned to return stale data less than 5% of the time.
Abstract: The bandwidth demands of the World Wide Web continue to grow at a hyper-exponential rate. Given this rocketing growth, caching of web objects as a means to reduce network bandwidth consumption is likely to be a necessity in the very near future. Unfortunately, many Web caches do not satisfactorily maintain cache consistency. This paper presents a survey of contemporary cache consistency mechanisms in use on the Internet today and examines recent research in Web cache consistency. Using trace-driven simulation, we show that a weak cache consistency protocol (the one used in the Alex ftp cache) reduces network bandwidth consumption and server load more than either time-to-live fields or an invalidation protocol and can be tuned to return stale data less than 5% of the time.

342 citations


Journal ArticleDOI
TL;DR: This article presents the design, implementation, and performance of a file system that integrates application-controlled caching, prefetching, and disk scheduling and shows that this combination of techniques greatly improves the performance of the file system.
Abstract: As the performance gap between disks and micropocessors continues to increase, effective utilization of the file cache becomes increasingly immportant. Application-controlled file caching and prefetching can apply application-specific knowledge to improve file cache management. However, supporting application-controlled file caching and prefetching is nontrivial because caching and prefetching need to be integrated carefully, and the kernel needs to allocate cache blocks among processes appropriately. This article presents the design, implementation, and performance of a file system that integrates application-controlled caching, prefetching, and disk scheduling. We use a two-level cache management strategy. The kernel uses the LRU-SP (Least-Recently-Used with Swapping and Placeholders) policy to allocate blocks to processes, and each process integrates application-specific caching and prefetching based on the controlled-aggressive policy, an algorithm previously shown in a theoretical sense to be nearly optimal. Each process also improves its disk access latency by submittint its prefetches in batches so that the requests can be scheduled to optimize disk access performance. Our measurements show that this combination of techniques greatly improves the performance of the file system. We measured that the running time is reduced by 3% to 49% (average 26%) for single-process workloads and by 5% to 76% (average 32%) for multiprocess workloads.

249 citations


Proceedings ArticleDOI
03 Feb 1996
TL;DR: A cache design that provides the same miss rate as a two-way set associative cache, but with an access time closer to a direct-mapped cache, and is easier to implement than previous designs.
Abstract: In this paper we propose a cache design that provides the same miss rate as a two-way set associative cache, but with an access time closer to a direct-mapped cache As with other designs, a traditional direct-mapped cache is conceptually partitioned into multiple banks, and the blocks in each set are probed, or examined, sequentially Other designs either probe the set in a fixed order or add extra delay in the access path for all accesses We use prediction sources to guide the cache examination, reducing the amount of searching and thus the average access latency A variety of accurate prediction sources are considered, with some being available in early pipeline stages We feel that our design offers the same or better performance and is easier to implement than previous designs

233 citations


Proceedings ArticleDOI
10 Jun 1996
TL;DR: The paper describes how to incorporate the effect of instruction cache to the Response Time schedulability Analysis (RTA), an efficient analysis for preemptive fixed priority schedulers and compares the results of such an approach to both cache partitioning and CRMA.
Abstract: Cache memories are commonly avoided in real time systems because of their unpredictable behavior. Recently, some research has been done to obtain tighter bounds on the worst case execution time (WCET) of cached programs. These techniques usually assume a non preemptive underlying system. However, some techniques can be applied to allow the use of caches in preemptive systems. The paper describes how to incorporate the effect of instruction cache to the Response Time schedulability Analysis (RTA). RTA is an efficient analysis for preemptive fixed priority schedulers. We also compare through simulations the results of such an approach to both cache partitioning (increase of the cache predictability by assigning private cache partitions to tasks) and CRMA (Cached RMA: cache effect is incorporated in the utilization based rate monotonic schedulability analysis). The results show that the cached version of RTA (CRTA) clearly outperforms CRMA, however the partitioning scheme may be better depending on the system configuration. The obtained results bound the applicability domain for each method for a variety of hardware and workload configurations. The results can be used as design guidelines.

182 citations


Proceedings ArticleDOI
26 Feb 1996
TL;DR: An energy-efficient cache invalidation method, called GCORE (Grouping with COld update-set REtention), that allows a mobile computer to operate in a disconnected mode to save the battery while still retaining most of the caching benefits after a reconnection is presented.
Abstract: Caching can reduce the bandwidth requirement in a mobile computing environment. However, due to battery power limitations, a wireless mobile computer may often be forced to operate in a doze (or even totally disconnected) mode. As a result, the mobile computer may miss some cache invalidation reports broadcast by a server, forcing it to discard the entire cache contents after waking up. In this paper, we present an energy-efficient cache invalidation method, called GCORE (Grouping with COld update-set REtention), that allows a mobile computer to operate in a disconnected mode to save the battery while still retaining most of the caching benefits after a reconnection. We present an efficient implementation of GCORE and conduct simulations to evaluate its caching effectiveness. The results show that GCORE can substantially improve mobile caching by reducing the communication bandwidth (or energy consumption) for query processing.

173 citations


Proceedings Article
03 Sep 1996
TL;DR: The design of an intelligent cache manager for sets retrieved by queries called WATCHMAN, which is particularly well suited for data warehousing environment, and achieves a substantial performance improvement in a decision support environment when compared to a traditional LRU replacement algorithm.
Abstract: Data warehouses store large volumes of data which are used frequently by decision support applications. Such applications involve complex queries. Query performance in such an environment is critical because decision support applications often require interactive query response time. Because data warehouses are updated infrequently, it becomes possible to improve query performance by caching sets retrieved by queries in addition to query execution plans. In this paper we report on the design of an intelligent cache manager for sets retrieved by queries called WATCHMAN, which is particularly well suited for data warehousing environment. Our cache manager employs two novel, complementary algorithms for cache replacement and for cache admission. WATCHMAN aims at minimizing query response time and its cache replacement policy swaps out entire retrieved sets of queries instead of individual pages. The cache replacement and admission algorithms make use of a profit metric, which considers for each retrieved set its average rate of reference, its size, and execution cost of the associated query. We report on a performance evaluation based on the TPC-D and Set Query benchmarks. These experiments show that WATCHMAN achieves a substantial performance improvement in a decision support environment when compared to a traditional LRU replacement algorithm.

165 citations


Patent
David Brian Kirk1
18 Mar 1996
Abstract: The traditional computer system is modified by providing, in addition to a processor unit, a main memory and a cache memory buffer, remapping logic for remapping the cache memory buffer, and a plurality of registers for containing remapping information. With this environment the cache memory buffer is divided into segments, and the segments are one or more cache lines allocated to a task to form a partition, so as to make available (if a size is set above zero) of a shared partition and a group of private partitions. Registers include the functions of count registers which contain count information for the number of cache segments in a specific partition, a flag register, and two register which act as cache identification number registers. The flag register has bits acting as a flag, which bits include a non-real time flag which allows operation without the partition system, a private partition permitted flag, and a private partition selected flag. With this system a traditional computer system can be changed to operate without impediments of interrupts and other prior impediments to a real-time task to perform. By providing cache partition areas, and causing an active task to always have a pointer to a private partition, and a size register to specify how many segments can be used by the task, real time systems can take advantage of a cache. Thus each task can make use of a shared partition, and know how many segments can be used by the task. The system cache provides a high speed access path to memory data, so that during execution of a task the logic means and registers provide any necessary cache partitioning to assure a preempted task that it's cache contents will not be destroyed by a preempting task. This permits use of a software controlled partitioning system which allows segments of a cache to be statically allocated on a priority I benefit basis without hardware modification to said system. The cache allocation provided by the logic gives consideration of the scheduling requirements of tasks of the system in deciding the size of each cache partition. Accordingly, the cache can make use of a for dynamic programming implementation of an allocation algorithm which can determine an optimal cache allocation in polynomial time.

155 citations


Proceedings ArticleDOI
12 Aug 1996
TL;DR: This paper presents a simple but efficient novel hardware design called the non-temporal streaming (NTS) cache that supplements the conventional direct-mapped cache with a parallel fully associative buffer.
Abstract: Direct-mapped caches are often plagued by conflict misses because they lack the associativity to store more than one memory block in each set. However, some blocks that have no temporal locality actually cause program execution degradation by displacing blocks that do manifest temporal behavior. In this paper, we present a simple but efficient novel hardware design called the non-temporal streaming (NTS) cache that supplements the conventional direct-mapped cache with a parallel fully associative buffer. Every cache block loaded into the main cache is monitored for temporal behavior by a hardware detection unit. Cache blocks identified as nontemporal are allocated to the buffer on subsequent requests. Our simulations show that the NTS Cache not only provides a performance improvement over the conventional direct-mapped cache, but can also save on-chip area. For some numerical programs like FFTPDE, APPSP and APPBT from the NAS benchmark suite, an integral NTS Cache of size 9 KB (i.e., 8 KB direct-mapped cache plus 1 KB NT buffer) performs as well as a 16 KB conventional direct-mapped cache.

Patent
20 Dec 1996
TL;DR: In this paper, a hybrid NUMA/COMA cache architecture with cache-coherent protocol is proposed for a computer system having a plurality of sub-systems coupled to each other via a system interconnect.
Abstract: The present invention provides a hybrid Non-Uniform Memory Architecture (NUMA) and Cache-Only Memory Architecture (COMA) caching architecture together with a cache-coherent protocol for a computer system having a plurality of sub-systems coupled to each other via a system interconnect. In one implementation, each sub-system includes at least one processor, a page-oriented COMA cache and a line-oriented hybrid NUMA/COMA cache. Such a hybrid system provides flexibility and efficiency in caching both large and small, and/or sparse and packed data structures. Each sub-system is able to independently store data in COMA mode or in NUMA mode. When caching in COMA mode, a sub-system allocates a page of memory space and then stores the data within the allocated page in its COMA cache. Depending on the implementation, while caching in COMA mode, the sub-system may also store the same data in its hybrid cache for faster access. Conversely, when caching in NUMA mode, the sub-system stores the data, typically a line of data, in its hybrid cache.

Patent
13 Nov 1996
TL;DR: In this article, an integrated processor and level two (L2) dynamic random access memory (DRAM) are fabricated on a single chip, and the L2 DRAM cache is placed on the same chip as the processor to reduce the time needed for two chip-to-chip crossings.
Abstract: An integrated processor and level two (L2) dynamic random access memory (DRAM) are fabricated on a single chip. As an extension of this basic structure, the invention also contemplates multiprocessor "node" chips in which multiple processors are integrated on a single chip with L2 cache. By integrating the processor and L2 DRAM cache on a single chip, high on-chip bandwidth, reduced latency and higher performance are achieved. A multiprocessor system can be realized in which a plurality of processors with integrated L2 DRAM cache are connected in a loosely coupled multiprocessor system. Alternatively, the single chip technology can be used to implement a plurality of processors integrated on a single chip with an L2 DRAM cache which may be either private or shared. This approach overcomes a number of issues which limit the performance and cost of a memory hierarchy. When the L2 DRAM cache is placed on the same chip as the processor, the time needed for two chip-to-chip crossings is eliminated. Since these crossings require off-chip drivers and receivers and must be synchronized with the system clock, the time involved is substantial. This means that with the integrated L2 DRAM cache, latency is reduced.

Patent
Douglas B. Boyle1
05 Jan 1996
TL;DR: Group cache look-up tables minimize requests for data items outside the groups and greatly minimize the service load on servers having popular data items as discussed by the authors, where each client in the group has access to the group cache lookup table, and any client or group can cache any data item.
Abstract: An information system and method for reducing workload load on servers in an information system network. The system defines a group of interconnected clients which have associated cache memories. The system maintains a shared group cache look-up table for the group having entries which identify data items cached by the clients within the group and identify the clients at which the data items are cached. Each client in the group has access to the group cache look-up table, and any client or group can cache any data item. The system can include a hierarchy of groups, with each group having a group cache look-up table. The group cache look-up tables minimize requests for data items outside the groups and greatly minimize the service load on servers having popular data items.

Proceedings ArticleDOI
01 Sep 1996
TL;DR: Experiments with several application programs show that the thread scheduling method can improve program performance by reducing second-level cache misses.
Abstract: This paper describes a method to improve the cache locality of sequential programs by scheduling fine-grained threads. The algorithm relies upon hints provided at the time of thread creation to determine a thread execution order likely to reduce cache misses. This technique may be particularly valuable when compiler-directed tiling is not feasible. Experiments with several application programs, on two systems with different cache structures, show that our thread scheduling method can improve program performance by reducing second-level cache misses.

Patent
28 Mar 1996
TL;DR: In this article, the cache controller has two modes of operation, including a first standard mode of operation in which read/write access to the cache memory is preceded by generation of the hit/miss signal by the comparator, and a second accelerated mode of operating without waiting for the comparators to process the access request's address value.
Abstract: A multiprocessor computer system has data processors and a main memory coupled to a system controller. Each data processor has a cache memory. Each cache memory has a cache controller with two ports for receiving access requests. A first port receives access requests from the associated data processor and a second port receives access requests from the system controller. All cache memory access requests include an address value; access requests from the system controller also include a mode flag. A comparator in the cache controller processes the address value in each access request and generates a hit/miss signal indicating whether the data block corresponding to the address value is stored in the cache memory. The cache controller has two modes of operation, including a first standard mode of operation in which read/write access to the cache memory is preceded by generation of the hit/miss signal by the comparator, and a second accelerated mode of operation in which read/write access to the cache memory is initiated without waiting for the comparator to process the access request's address value. The first mode of operation is used for all access requests by the data processor and for system controller access requests when the mode flag has a first value. The second mode of operation is used for the system controller access requests when the mode flag has a second value distinct from the first value.

Journal ArticleDOI
01 May 1996
TL;DR: It is shown that even a small amount of main memory that is used as a document cache, is enough to hold more than 60% of the documents requested, and that traditional file system cache management methods are inappropriate for managing Main Memory Web caches.
Abstract: An increasing amount of information is currently becoming available through World Wide Web servers. Document requests to popular Web servers arrive every few tens of milliseconds at peak rate. To reduce the overhead imposed by frequent document requests, we propose the notion of caching a World Wide Web server's documents in its main memory (which we call Main Memory Web Caching). We show that even a small amount of main memory (512 Kbytes) that is used as a document cache, is enough to hold more than 60% of the documents requested. We also show that traditional file system cache management methods are inappropriate for managing Main Memory Web caches, and may result in poor performance. Based on trace-driven simulations of several server traces we quantify our claims, and propose a new cache management that dynamically adjusts itself to the clients' request pattern and cache size. We show that our policy is robust over a variety of parameters and results is better overall performance.

Patent
15 Apr 1996
TL;DR: In this paper, the authors propose a separate region conversion system that is capable of maintaining the hit rate of the cache at high level by simplifying the cache status and to improve the execution efficiency of the application program.
Abstract: To reduce process time by simplifying the cache status and to improve the execution efficiency of the application program in the separate region conversion system that is capable of maintaining the hit rate of the cache at high level. When an access request is made to the object and if the object is not stored in the object cache, the page containing the object is read from the database and is stored in the page cache, and the object is read from the page and stored in the cache. The status of the page cache describing the status of the page stored in the page cache is stored in the page status storage device and at the same time the status of the object cache describing the status of the object stored in the object cache is stored in the object status storage device. By establishing a relationship between the status of the page cache and the status of the object, if the status of the page cache and the corresponding status of the object cache are not consistent, the status synchronizing device executes a synchronization process to make these status consistent.

Patent
Yet-Ping Pai1, Le T. Nguyen1
15 Nov 1996
TL;DR: In this paper, a cache control unit and a method of controlling a cache is coupled to a cache accessing device, and a request identification information is assigned to the first cache request and provided to the requesting device.
Abstract: A cache control unit and a method of controlling a cache. The cache is coupled to a cache accessing device. A first cache request is received from the device. A request identification information is assigned to the first cache request and provided to the requesting device. The first cache request may begin to be processed. A second cache request is received from the cache accessing device. The second cache request is assigned to the first cache request and provided to the requesting device. The first and second cache requests are finally fully serviced.

Patent
Brian Berliner1
01 May 1996
TL;DR: In this article, a multi-tier cache system and a method for implementing the multilevel cache system is described, where a small cache in random access memory (RAM) is managed in a Least Recent Used (LRU) fashion.
Abstract: A multi-tier cache system and a method for implementing the multi-tier cache system is disclosed. The multi-tier cache system has a small cache in random access memory (RAM) that is managed in a Least Recent Used (LRU) fashion. The RAM cache is a subset of a much larger non-volatile cache on rotating magnetic media (e.g., a hard disk drive). The non-volatile cache is, in turn a subset of a local CD-ROM or of a CD-ROM or mass storage device controlled by a server system. In a preferred embodiment of the invention, a heuristic technique is employed to establish a RAM cache of optimum size within the system memory. Also in a preferred embodiment, the RAM cache is made up of multiple identically-sized sub-blocks. A small amount of RAM is utilized to maintain a table which implements a Least Recently Used (LRU) RAM cache purging scheme. A hashing mechanism is employed to search for the "bucket" within the RAM cache in which the requested data may be located. If the requested data is in the RAM cache, the request is satisfied with that data. If the requested data is not in the RAM cache, the least recently used sub-block is purged from the cache if the cache is full, and the RAM cache is updated from the non-volatile cache whenever possible, and from the cached storage device when the non-volatile cache does not contain the requested data.

Patent
29 Oct 1996
TL;DR: In this paper, the authors propose a caching logic consisting of a selection logic and an admission control logic to decide whether an object not currently in the cache is accessed may be cached at all.
Abstract: A system and method for caching objects of non-uniform size. A caching logic includes a selection logic and an admission control logic. The admission control logic determines whether an object not currently in the cache is accessed may be cached at all. The admission control logic uses an auxiliary LRU stack which contains the identities and time stamps of the objects which have been recently accessed. Thus, the memory required is relatively small. The auxiliary cache serves as a dynamic popularity list and an object may be admitted to the cache if and only if it appears on the popularity list. The selection logic selects one or more of the objects in the cache which have to be purged when a new object enters the cache. The order of removal of the objects is prioritized based both on the size as well as the frequency of access of the object and may be adjusted by a time to obsolescence factor (TTO). To reduce the time required to compare the space-time product of each object in the cache, the objects may be classified in ranges having geometrically increasing intervals. Specifically, multiple LRU stacks are maintained independently wherein each LRU stack contains only objects in a predetermined range of sizes. In order to choose candidates for replacement, only the least recently used objects in each group need be considered.

Patent
15 Aug 1996
TL;DR: In this article, a hierarchical cache architecture that reduces traffic on a main memory bus while overcoming the disadvantages of prior systems is proposed. But it does not address the disadvantages associated with the use of store-through-type caches at level one.
Abstract: A hierarchical cache architecture that reduces traffic on a main memory bus while overcoming the disadvantages of prior systems. The architecture includes a plurality of level one caches that are of the store through type, each level one cache is associated with a processor and may be incorporated into the processor. Subsets (or "clusters") of processors, along with their associated level one caches, are formed and a level two cache is provided for each cluster. Each processor-level one cache pair within a cluster is coupled to the cluster's level two cache through a dedicated bus. By configuring the processors and caches in this manner, not only is the speed advantage normally associated with the use of cache memory realized, but the number of memory bus accesses is reduced without the disadvantages associated with the use of store in type caches at level one and without the disadvantages associated with the use of a shared cache bus.

Patent
28 Jun 1996
TL;DR: In this article, the authors propose an apparatus and method for synchronizing a cache mode in a cache memory system in a computer to protect cache operations, where cache mode is stored as metadata in the cache modules and is detected by the first controller to determine the cache mode.
Abstract: An apparatus and method for synchronizing a cache mode in a cache memory system in a computer to protect cache operations. The cache memory system has a first controller and a second controller and two cache modules and operates in a plurality of cache modes. The cache mode is stored as metadata in the cache modules and is detected by the first controller to determine the cache mode. Lock signals in the first controller are set in accordance with the cache mode detected to set the cache mode state in the first controller. The second controller copies the cache mode state from the first controller to synchronize both controllers in the same cache mode state. After a failure of the second controller, the first controller may lock access to both caches to recover data previously accessed by the second controller. The second controller restarts and copies the cache mode state from the first controller, so that both controllers return to the cache mode state prior to the failure of the second controller.

Book
31 Mar 1996
TL;DR: This work modeling a Page Server DBMS architecture towards a Flexible Distributed DBMS Architecture and the performance of Cache Consistency Algorithms shows clear trends in both client and server performance.
Abstract: Foreword Preface 1 Introduction 2 Client-Server Database Systems 3 Modeling a Page Server DBMS 4 Client Cache Consistency 5 Performance of Cache Consistency Algorithms 6 Global Memory Management 7 Local Disk Caching 8 Towards a Flexible Distributed DBMS Architecture 9 Conclusions References Index

Patent
28 Jun 1996
TL;DR: In this paper, the cache memory system is enabled into one of a plurality of cache modes in cache memory systems in a computer, where the cache memories are partitioned into quadrants with two quadrants in each cache memory.
Abstract: A cache memory system is enabled into one of a plurality of cache modes in a cache memory system in a computer. The cache memory system has a first controller and two cache memories, the cache memories are partitioned into quadrants with two quadrants in each cache memory. A cache mode detector in the first controller detects a mirror cache mode set for the cache memory system. An address enabler in the first controller enables access to first pair of quadrants, one quadrant in each cache memory, in response to detection of a mirror cache mode. A second controller follows the cache mode set by the cache mode detector and has an address enabler. The address enabler in the second controller enables access to both quadrants in one cache memory in a non-mirror cache mode, and enables the access to a second pair of quadrants, one quadrant in each cache memory, in response to detection of a mirror cache mode by said cache mode detector.

Patent
Millind Mittal1
17 Dec 1996
TL;DR: In this article, the locality hint is used to identify a lowest level where management of cache avocation is desired and cache memory is allocated at that level and any higher level(s).
Abstract: A computer system and method in which allocation of a cache memory is managed by utilizing a locality hint value included within an instruction. When a processor accesses a memory for transfer of data between the processor and the memory, that access can be allocated or not allocated in the cache memory. The locality hint included within the instruction controls if the cache allocation is to be made. When a plurality of cache memories are present, they are arranged into a cache hierarchy and a locality value is assigned to each level of the cache hierarchy where allocation control is desired. The locality hint may be used to identify a lowest level where management of cache avocation is desired and cache memory is allocated at that level and any higher level(s). The locality hint value is based on spatial and/or temporal locality for the data associated with the access. Data is recognized at each cache hierarchy level depending on the attributes associated with the data at a particular level. If the locality hint identifies a particular access for data as temporal or non-temporal with respect to a particular cache level, the particular access may be determined to be temporal or non-temporal with respect to the higher and lower cache levels.

Proceedings ArticleDOI
01 May 1996
TL;DR: The authors' techniques for improving the bandwidth of a single cache port by using additional buffering in the processor, and by taking maximum advantage of a wider cache port achieve 91% of the performance of a dual-ported cache.
Abstract: The memory bandwidth demands of modern microprocessors require the use of a multi-ported cache to achieve peak performance. However, multi-ported caches are costly to implement. In this paper we propose techniques for improving the bandwidth of a single cache port by using additional buffering in the processor, and by taking maximum advantage of a wider cache port. We evaluate these techniques using realistic applications that include the operating system. Our techniques using a single-ported cache achieve 91% of the performance of a dual-ported cache.

Proceedings ArticleDOI
17 Jun 1996
TL;DR: The stream cache, proposed for the first time in this paper, has the potential to cut execution times by half with the addition of a relatively small amount of additional hardware.
Abstract: Data prefetching is a well known technique for improving cache performance. While several studies have examined prefetch strategies for scientific and commercial applications, no published work has studied the special memory requirements of multimedia applications. This paper presents data for three types of hardware prefetching schemes: stream buffers, stride prediction tables, and a hybrid combination of the two, the stream cache. Use of the stride prediction table is shown to eliminate up to 90% of the misses that would otherwise be incurred in a moderate or large sized cache with no prefetching hardware. The stream cache, proposed for the first time in this paper, has the potential to cut execution times by half with the addition of a relatively small amount of additional hardware.

Patent
Robert Yung1
13 Mar 1996
TL;DR: In this article, a cache structure for a microprocessor which provides set-prediction information for a separate, second-level cache, and a method for improving cache accessing, are provided.
Abstract: A cache structure for a microprocessor which provides set-prediction information for a separate, second-level cache, and a method for improving cache accessing, are provided. In the event of a first-level cache miss, the second-level set-prediction information is used to select the set in an N-way off-chip set-associative cache. This allows a set-associative structure to be used in a second-level cache (on or off chip) without requiring a large number of traces and/or pins. Since set-prediction is used, the subsequent access time for a comparison to determine that the correct set was predicted is not in the critical timing path unless there is a mis-prediction or a miss in the second-level cache. Also, a cache memory can be partitioned into M sets, with M being chosen so that the set size is less than or equal to the page size, allowing a cache access before a TLB translation is done, further speeding the access.

Patent
Peichun Peter Liu1
29 Apr 1996
TL;DR: In this article, content-addressable tag-compare arrays (CAMs) are used to select a cache line, and arbitration logic in each subarray selects a word line (cache line).
Abstract: A cache memory for a computer uses content-addressable tag-compare arrays (CAM) to determine if a match occurs. The cache memory is partitioned in four subarrays, i.e., interleaved, providing a wide cache line (word lines) but shallow depth (bit lines). The cache can be accessed by multiple addresses, producing multiple data outputs in a given cycle. Two effective addresses and one real address are applied at one time, and if addresses are matched in different subarrays, or two on the same line in a single subarray, then multiple access is permitted. The two content-addressable memories, or CAMs, are used to select a cache line, and in parallel with this, arbitration logic in each subarray selects a word line (cache line).