scispace - formally typeset
Search or ask a question

Showing papers on "Smart Cache published in 1995"


18 Jul 1995
TL;DR: This work assesses the potential of proxy servers to cache documents retrieved with the HTTP protocol, and finds that a proxy server really functions as a second level cache, and its hit rate may tend to decline with time after initial loading given a more or less constant set of users.
Abstract: As the number of World-Wide Web users grow, so does the number of connections made to servers. This increases both network load and server load. Caching can reduce both loads by migrating copies of server files closer to the clients that use those files. Caching can either be done at a client or in the network (by a proxy server or gateway). We assess the potential of proxy servers to cache documents retrieved with the HTTP protocol. We monitored traffic corresponding to three types of educational workloads over a one semester period, and used this as input to a cache simulation. Our main findings are (1) that with our workloads a proxy has a 30-50% maximum possible hit rate no matter how it is designed; (2) that when the cache is full and a document is replaced, least recently used (LRU) is a poor policy, but simple variations can dramatically improve hit rate and reduce cache size; (3) that a proxy server really functions as a second level cache, and its hit rate may tend to decline with time after initial loading given a more or less constant set of users; and (4) that certain tuning configuration parameters for a cache may have little benefit.

495 citations



Proceedings ArticleDOI
05 Jun 1995
TL;DR: The results suggest that distinguishing between documents produced locally and those produced remotely can provide useful leverage in designing caching policies, because of differences in the potential for sharing these two document types among multiple users.
Abstract: With the increasing demand for document transfer services such as the World Wide Web comes a need for better resource management to reduce the latency of documents in these systems. To address this need, we analyze the potential for document caching at the application level in document transfer services. We have collected traces of actual executions of Mosaic, reflecting over half a million user requests for WWW documents. Using those traces, we study the tradeoffs between caching at three levels in the system, and the potential for use of application-level information in the caching system. Our traces show that while a high hit rate in terms of URLs is achievable, a much lower hit rate is possible in terms of bytes, because most profitably-cached documents are small. We consider the performance of caching when applied at the level of individual user sessions, at the level of individual hosts, and at the level of a collection of hosts on a single LAN. We show that the performance gain achievable by caching at the session level (which is straightforward to implement) is nearly all of that achievable at the LAN level (where caching is more difficult to implement). However, when resource requirements are considered, LAN level caching becomes muck more desirable, since it can achieve a given level of caching performance using a much smaller amount of cache space. Finally, we consider the use of organizational boundary information as an example of the potential for use of application-level information in caching. Our results suggest that distinguishing between documents produced locally and those produced remotely can provide useful leverage in designing caching policies, because of differences in the potential for sharing these two document types among multiple users. >

177 citations


Journal ArticleDOI
01 Nov 1995
TL;DR: A method to maintain predictability of execution time within preemptive, cached real-time systems is introduced and the impact on compilation support for such a system is discussed.
Abstract: Cache memories have become an essential part of modern processors to bridge the increasing gap between fast processors and slower main memory. Until recently, cache memories were thought to impose unpredictable execution time behavior for hard real-time systems. But recent results show that the speedup of caches can be exploited without a significant sacrifice of predictability. These results were obtained under the assumption that real-time tasks be scheduled non-preemptively.This paper introduces a method to maintain predictability of execution time within preemptive, cached real-time systems and discusses the impact on compilation support for such a system. Preemptive systems with caches are made predictable via software-based cache partitioning. With this approach, the cache is divided into distinct portions associated with a real-time task, such that a task may only use its portion. The compiler has to support instruction and data partitioning for each task. Instruction partitioning involves non-linear control-flow transformations, while data partitioning involves code transformations of data references. The impact on execution time of these transformations is also discussed.

139 citations


Journal ArticleDOI
TL;DR: In this paper, the authors analyzed two days of queries to the NCSA Mosaic server to assess the geographic distribution of transaction requests and found that caching the results of queries within the geographic zone from which the request was sourced, in terms of reduction of transactions with and bandwidth volume from the main server.
Abstract: We analyze two days of queries to the popular NCSA Mosaic server to assess the geographic distribution of transaction requests. The wide geographic diversity of query sources and popularity of a relatively small portion of the web server file set present a strong case for deployment of geographically distributed caching mechanisms to improve server and network efficiency. The NCSA web server consists of four servers in a cluster. We show time series of bandwidth and transaction demands for the server cluster and break these demands down into components according to geographical source of the query. We analyze the impact of caching the results of queries within the geographic zone from which the request was sourced, in terms of reduction of transactions with and bandwidth volume from the main server. We find that a cache document timeout even as low as 1024 seconds (about 17 minutes) during the two days that we analyzed would have saved between 40% and 70% of the bytes transferred from the central server. We investigate a range of timeouts for flushing documents from the cache, outlining the tradeoff between bandwidth savings and memory/cache management costs. We discuss the implications of this tradeoff in the face of possible future usage-based pricing of backbone services that may connect several cache sites. We also discuss other issues that caching inevitably poses, such as how to redirect queries initially destined for a central server to a preferred cache site. The preference of a cache site may be a function of not only geographic proximity, but also current load on nearby servers or network links. Such refinements in the web architecture will be essential to the stability of the network as the web continues to grow, and operational geographic analysis of queries to archive and library servers will be fundamental to its effective evolution.

118 citations


Patent
31 Aug 1995
TL;DR: In this article, a data cache configured to perform store accesses in a single clock cycle is provided, where the data cache speculatively stores data within a predicted way of the cache after capturing the data currently being stored in that predicted way.
Abstract: A data cache configured to perform store accesses in a single clock cycle is provided. The data cache speculatively stores data within a predicted way of the cache after capturing the data currently being stored in that predicted way. During a subsequent clock cycle, the cache hit information for the store access validates the way prediction. If the way prediction is correct, then the store is complete. If the way prediction is incorrect, then the captured data is restored to the predicted way. If the store access hits in an unpredicted way, the store data is transferred into the correct storage location within the data cache concurrently with the restoration of data in the predicted storage location. Each store for which the way prediction is correct utilizes a single clock cycle of data cache bandwidth. Additionally, the way prediction structure implemented within the data cache bypasses the tag comparisons of the data cache to select data bytes for the output. Therefore, the access time of the associative data cache may be substantially similar to a direct-mapped cache access time. The present data cache is therefore suitable for high frequency superscalar microprocessors.

114 citations


Patent
13 Nov 1995
TL;DR: In this paper, a cache line is merged with the cache line prior to storage in the cache and other matching entries become active and are allowed to reaccess the cache (71).
Abstract: A data processor (40) keeps track of misses to a cache (71) so that multiple misses within the same cache line can be merged or folded at reload time. A load/store unit (60) includes a completed store queue (61) for presenting store requests to the cache (71) in order. If a store request misses in the cache (71), the completed store queue (61) requests the cache line from a lower-level memory system (90) and thereafter inactivates the store request. When a reload cache line is received, the completed store queue (61) compares the reload address to all entries. If at least one address matches the reload address, one entry's data is merged with the cache line prior to storage in the cache (71). Other matching entries become active and are allowed to reaccess the cache (71). A miss queue (80) coupled between the load/store unit (60) and the lower-level memory system (90) implements reload folding to improve efficiency.

111 citations


Patent
13 Oct 1995
TL;DR: In this paper, an adaptive read ahead cache is provided with a real cache and a virtual cache, where the real cache has a data buffer, an address buffer, and a status buffer.
Abstract: An adaptive read ahead cache is provided with a real cache and a virtual cache. The real cache has a data buffer, an address buffer, and a status buffer. The virtual cache contains only an address buffer and a status buffer. Upon receiving an address associated with the consumer's request, the cache stores the address in the virtual cache address buffer if the address is not found in the real cache address buffer and the virtual cache address buffer. Further, the cache fills the real cache data buffer with data responsive to the address from said memory if the address is found only in the virtual cache address buffer. The invention thus loads data into the cache only when sequential accesses are occurring and minimizes the overhead of unnecessarily filling the real cache when the host is accessing data in a random access mode.

96 citations


Proceedings ArticleDOI
01 Jun 1995
TL;DR: This paper explores two methods to reduce this overhead for virtual stack machines by caching top-of-stack values in (real machine) registers by using a dynamic or a static method.
Abstract: An interpreter can spend a significant part of its execution time on accessing arguments of virtual machine instructions. This paper explores two methods to reduce this overhead for virtual stack machines by caching top-of-stack values in (real machine) registers. The dynamic method is based on having, for every possible state of the cache, one specialized version of the whole interpreter; the execution of an instruction usually changes the state of the cache and the next instruction is executed in the version corresponding to the new state. In the static method a state machine that keeps track of the cache state is added to the compiler. Common instructions exist in specialized versions for several states, but it is not necessary to have a version of every instruction for every cache state. Stack manipulation instructions are optimized away.

94 citations


20 Nov 1995
TL;DR: The technique of static cache simulation is shown to address the issue of predicting cache behavior, contrary to the belief that cache memories introduce unpredictability to real-time systems that cannot be efficiently analyzed.
Abstract: This work takes a fresh look at the simulation of cache memories. It introduces the technique of static cache simulation that statically predicts a large portion of cache references. To efficiently utilize this technique, a method to perform efficient on-the-fly analysis of programs in general is developed and proved correct. This method is combined with static cache simulation for a number of applications. The application of fast instruction cache analysis provides a new framework to evaluate instruction cache memories that outperforms even the fastest techniques published. Static cache simulation is shown to address the issue of predicting cache behavior, contrary to the belief that cache memories introduce unpredictability to real-time systems that cannot be efficiently analyzed. Static cache simulation for instruction caches provides a large degree of predictability for real-time systems. In addition, an architectural modification through bit-encoding is introduced that provides fully predictable caching behavior. Even for regular instruction caches without architectural modifications, tight bounds for the execution time of real-time programs can be derived from the information provided by the static cache simulator. Finally, the debugging of real-time applications can be enhanced by displaying the timing information of the debugged program at breakpoints. The timing information is determined by simulating the instruction cache behavior during program execution and can be used, for example, to detect missed deadlines and locate time-consuming code portions. Overall, the technique of static cache simulation provides a novel approach to analyze cache memories and has been shown to be very efficient for numerous applications.

93 citations


Patent
31 Mar 1995
TL;DR: In this paper, a multiprocessor computer system is provided having a multiplicity of sub-systems and a main memory coupled to a system controller, each of which includes a master interface having master classes for sending memory transaction requests to the system controller.
Abstract: A multiprocessor computer system is provided having a multiplicity of sub-systems and a main memory coupled to a system controller. An interconnect module, interconnects the main memory and sub-systems in accordance with interconnect control signals received from the system controller. At least two of the sub-systems are data processors, each having a respective cache memory that stores multiple blocks of data and a respective master cache index. Each master cache index has a set of master cache tags (Etags), including one cache tag for each data block stored by the cache memory. Each data processor includes a master interface having master classes for sending memory transaction requests to the system controller. The system controller includes memory transaction request logic for processing each memory transaction request by a data processor. The system controller maintains a duplicate cache index having a set of duplicate cache tags (Dtags) for each data processor. Each data processor has a writeback buffer for storing the data block previously stored in a victimized cache line until its respective writeback transaction is completed and an Nth+1 Dtag for storing the cache state of a cache line associated with a read transaction which is executed prior to an associated writeback transaction of a read-writeback transaction pair. Accordingly, upon a cache miss, the interconnect may execute the read and writeback transactions in parallel relying on the writeback buffer or Nth+1 Dtag to accommodate any ordering of the transactions.

Proceedings ArticleDOI
01 Dec 1995
TL;DR: This paper presents a latency-hiding compiler technique that is applicable to general-purpose C programs that 'preloads' the data that are likely to cause a cache-miss before they are used, and thereby hiding the cache miss latency.
Abstract: Previous research on hiding memory latencies has tended to focus on regular numerical programs. This paper presents a latency-hiding compiler technique that is applicable to general-purpose C programs. By assuming a lock-up free cache and instruction score-boarding, our technique 'preloads' the data that are likely to cause a cache-miss before they are used, and thereby hiding the cache miss latency. We have developed simple compiler heuristics to identify load instructions that are likely to cause a cache-miss. Experimentation with a set of SPEC92 benchmarks shows that our heuristics are successful in identifying 85% of cache misses. We have also developed an algorithm that flexibly schedules the selected load instruction and instructions that use the loaded data to hide memory latency. Our simulation suggests that our technique is successful in hiding memory latency and improves the overall performance.

Proceedings ArticleDOI
04 Jan 1995
TL;DR: Experimental results suggest that both the block buffering and Gray code addressing techniques are ideal for instruction cache designs which tend to be accessed in a consecutive sequence and can achieve an order of magnitude energy reduction on caches.
Abstract: Caches usually consume a significant amount of energy in modern microprocessors (eg superpipelined or superscalar processors) In this paper, we examine contemporary cache design techniques and provide an analytical model for estimating cache energy consumption We also present several novel techniques for designing energy-efficient caches, which include block buffering, cache sub-banking, and Gray code addressing Experimental results suggest that both the block buffering and Gray code addressing techniques are ideal for instruction cache designs which tend to be accessed in a consecutive sequence Cache sub-banking is ideal for both instruction and data caches Overall, these techniques can achieve an order of magnitude energy reduction on caches >

Patent
Akio Shigeeda1
15 Mar 1995
TL;DR: In this paper, an electronic device for use in a computer system, and having a small second-level write-back cache, is disclosed, where the device may be implemented into a single integrated circuit, as a microprocessor unit, to include a micro processor core, a memory controller circuit, and first and second level caches.
Abstract: An electronic device for use in a computer system, and having a small second-level write-back cache, is disclosed. The device may be implemented into a single integrated circuit, as a microprocessor unit, to include a microprocessor core, a memory controller circuit, and first and second level caches. In a system implementation, the device is connected to external dynamic random access memory (DRAM). The first level cache is a write-through cache, while the second level cache is a write-back cache that is much smaller than the first level cache. In operation, a write access that is a cache hit in the second level cache writes to the second level cache, rather than to DRAM, thus saving a wait state. A dirty bit is set for each modified entry in the second level cache. Upon the second level cache being full of modified data, a cache flush to DRAM is automatically performed. In addition, each entry of the second level cache is flushed to DRAM upon each of its byte locations being modified. The computer system may also include one or more additional integrated circuit devices, such as a direct memory access (DMA) circuit and a bus bridge interface circuit for bidirectional communication with the microprocessor unit. The microprocessor unit may also include handshaking control to prohibit configuration register updating when a memory access is in progress or is imminent. The disclosed microprocessor unit also includes circuitry for determining memory bank size and memory address type.

Patent
31 Aug 1995
TL;DR: In this article, a superscalar microprocessor employing a way prediction structure is provided, which predicts a way of an associative cache in which an access will hit, and causes the data bytes from the predicted way to be conveyed as the output of the cache.
Abstract: A superscalar microprocessor employing a way prediction structure is provided. The way prediction structure predicts a way of an associative cache in which an access will hit, and causes the data bytes from the predicted way to be conveyed as the output of the cache. The typical tag comparisons to the request address are bypassed for data byte selection, causing the access time of the associative cache to be substantially the access time of the direct-mapped way prediction array within the way prediction structure. Also included in the way prediction structure is a way prediction control unit configured to update the way prediction array when an incorrect way prediction is detected. The clock cycle of the superscalar microprocessor including the way prediction structure with its caches may be increased if the cache access time is limiting the clock cycle. Additionally, the associative cache may be retained in the high frequency superscalar microprocessor (which might otherwise employ a direct-mapped cache for access time reasons). Single clock cycle cache access to an associative data cache is maintained for high frequency operation.

Proceedings ArticleDOI
01 May 1995
TL;DR: This paper presents the design and evaluation of a fast address generation mechanism capable of eliminating the delays caused by effective address calculation for many loads and stores, and responds well to software support, in many cases providing better program speedups and reducing cache bandwidth requirements.
Abstract: For many programs, especially integer codes, untolerated load instruction latencies account for a significant portion of total execution time. In this paper, we present the design and evaluation of a fast address generation mechanism capable of eliminating the delays caused by effective address calculation for many loads and stores.Our approach works by predicting early in the pipeline (part of) the effective address of a memory access and using this predicted address to speculatively access the data cache. If the prediction is correct, the cache access is overlapped with non-speculative effective address calculation. Otherwise, the cache is accessed again in the following cycle, this time using the correct effective address. The impact on the cache access critical path is minimal; the prediction circuitry adds only a single OR operation before cache access can commence. In addition, verification of the predicted effective address is completely decoupled from the cache access critical path.Analyses of program reference behavior and subsequent performance analysis of this approach shows that this design is a good one, servicing enough accesses early enough to result in speedups for all the programs we tested. Our approach also responds well to software support, which can significantly reduce the number of mispredicted effective addresses, in many cases providing better program speedups and reducing cache bandwidth requirements.

Patent
Asit Dan1, Dinkar Sitaram1
31 Jul 1995
TL;DR: In this article, a system and method for caching sequential data streams in a cache storage device is presented, where each information stream is made as to whether its data blocks should be discarded from cache as they are read by a consuming process.
Abstract: A system and method for caching sequential data streams in a cache storage device. For each information stream, a determination is made as to whether its data blocks should discarded from cache as they are read by a consuming process. Responsive to a determination that the data blocks of a stream should be discarded from the cache are read by the consuming process, the data blocks associated with that stream are cached in accordance with an interval caching algorithm. Alternatively, responsive to a determination that the data blocks of a stream should not be discarded from the cache storage device as they are read by the consuming process, the data blocks of that stream are cached in accordance with a segment caching algorithm.

Proceedings ArticleDOI
02 Oct 1995
TL;DR: A combination of bypassing and register caching is proposed, taking advantage of register values that are bypassed within a processor's pipeline, and supplementing the bypassed values with values supplied by a small register cache to meet a fast target cycle time.
Abstract: VLIW, multi-context, or windowed-register architectures may require one hundred or more processor registers. It can be difficult to design a register file with so many registers that meets processor cycle time requirements. We propose to resolve this problem by taking advantage of register values that are bypassed within a processor's pipeline, and supplementing the bypassed values with values supplied by a small register cache. If the register cache is sufficiently small then it can be designed to meet a fast target cycle time. We call this combination of bypassing and register caching the register scoreboard and cache. We develop a simple performance model and show by simulations that it can be effective for windowed-register architectures.

Patent
23 Oct 1995
TL;DR: In this article, a two-level cache data structure and associated methods are implemented with a RAID controller to reduce the overhead of the RAID controller in determining which blocks are present in the lower level cache.
Abstract: Methods and associated data structures operable in a RAID subsystem to improve I/O performance. A two level cache data structure and associated methods are implemented with a RAID controller. The lower level cache comprises buffers holding recently utilized blocks of the disk devices. The upper level cache records which blocks are present in the lower level cache for each stripe in the RAID level 5 configuration. The upper level cache serves to reduce the overhead processing required of the RAID controller to determine which blocks are present in the lower level cache. Having more rapid access to this information by lowering the processing overhead enables the present invention to rapidly select between different write techniques to post data and error blocks from low level cache to the disk array. A RMW write technique is used to post data and error checking blocks to disk when insufficient information reside in the lower level cache. A faster Full Write technique (also referred to as Stripe Write) is used to post data and error checking blocks to disk when all required, related blocks are resident in the lower level cache. The Full Write technique reduces the total number of I/O operations required of the disk devices to post the update as compared to the RMW technique. The two level cache of the present invention enables a rapid selection between the RMW and Full Write techniques.

Patent
18 Apr 1995
TL;DR: In this article, the authors propose a self-recovery mechanism for errors in the associated cache directory or the shared cache itself by invalidating all the entries in the cache directory of the accessed congruence class by resetting Valid bits to "0" and setting the Parity bit to a correct value.
Abstract: A high available shared cache memory in a tightly coupled multiprocessor system provides an error self-recovery mechanism for errors in the associated cache directory or the shared cache itself. After an error in a congruence class of the cache is indicated by an error status register, self-recovery is accomplished by invalidating all the entries in the shared cache directory means of the accessed congruence class by resetting Valid bits to '0' and by setting the Parity bit to a correct value, wherein the request for data to the main memory is not cancelled. Multiple bit failures in the cached data are recovered by setting the Valid bit in the matching column to '0'. The processor reissues the request for data, which is loaded into the processor's private cache and the shared cache as well. Further requests to this data by other processors are served by the shared cache.

Patent
31 Oct 1995
TL;DR: In this article, the authors present a method and apparatus for controlling multiple cache memories with a single cache controller using a processor to control the operation of its on-chip level one cache memory and a level two cache memory.
Abstract: A method and apparatus for controlling multiple cache memories with a single cache controller. The present invention uses a processor to control the operation of its on-chip level one (L1) cache memory and a level two (L2) cache memory. In this manner, the processor is able to send operations to be performed to the L2 cache memory, such as writing state and/or cache line status to the L2 cache memory. A dedicated bus is coupled between dice. This dedicated bus is used to send control and other signals between the processor and the L2 cache memory.

Patent
03 Mar 1995
TL;DR: In this paper, the authors propose a distributed shared cache that operates at the level of a second-level memory cache, at the third level of the memory system, and at the purely software-managed page cache level.
Abstract: A distributed-shared cache operates at the level of a second-level memory cache, at the third level of the memory system, and at the purely software-managed page cache level. On a cache miss that is local to a processor, an attempt is made to locate the data in a cache memory block on a peer memory level, before explicitly requesting the data from more distant memory. Communication support is integrated into the memory system to piggyback communication performance improvements on improvements to the memory system. In particular, the cache lines can operate in a message mode to deliver message data to interested receivers to support networking and devices. Embodiments of the invention works across all memory levels with only modest changes in detail.

Patent
07 Jun 1995
TL;DR: In this paper, a power monitoring device places a write-back cache memory into a writethrough mode upon detection of a low-battery condition or a user request, which can save significant power during typical portable computer operations.
Abstract: A computer system having a power monitoring device places a write-back cache memory into a write-through mode upon detection of a low-battery condition or a user request. Under write-through mode, the cache memory need not be flushed every time a suspend mode of the computer system is entered. Thus significant power is saved during typical portable computer operations.

Patent
23 Aug 1995
TL;DR: In this paper, the prefetching of cache lines is performed in a progressive manner in a data processing system implementing L1 and L2 caches and stream filters and buffers, where data may not be prefetched.
Abstract: Within a data processing system implementing L1 and L2 caches and stream filters and buffers, prefetching of cache lines is performed in a progressive manner. In one mode, data may not be prefetched. In a second mode, two cache lines are prefetched wherein one line is prefetched into the L1 cache and the next line is prefetched into a stream buffer. In a third mode, more than two cache lines are prefetched at a time. In the third mode cache lines may be prefetched to the L1 cache and not the L2 cache, resulting in no inclusion between the L1 and L2 caches. A directory field entry provides an indication of whether or not a particular cache line in the L1 cache is also included in the L2 cache.

Patent
04 Aug 1995
TL;DR: In this article, a method and apparatus for instruction refetching in a processor is described, where a marker micro instruction is inserted into the processor pipeline when an instruction cache line is victimized.
Abstract: A method and apparatus for instruction refetch in a processor is provided. To ensure that a macro instruction is available for refetching after the processor has handled an event or determined a correct restart address after a branch misprediction, an instruction memory includes an instruction cache for caching macro instructions to be fetched, and a victim cache for caching victims from the instruction cache. To ensure the availability of a macro instruction for refetching, the instruction memory (the instruction cache and victim cache together) always stores a macro instruction that may need to be refetched until the macro instruction is committed to architectural state. A marker micro instruction is inserted into the processor pipeline when an instruction cache line is victimized. The marker specifies an entry in the victim cache occupied by the victimized cache line. When the marker instruction is committed to architectural state, the victim cache entry specified by the marker is deallocated in the victim cache to permit storage of other instruction cache victims.

Patent
06 Nov 1995
TL;DR: In this article, a control circuitry coupled with a stream filter circuit, selectively controls fetching and prefetching of data from system memory to the primary and secondary caches associated with a processor and to a stream buffer circuit.
Abstract: Within a data processing system implementing primary and secondary caches and stream filters and buffers, prefetching of cache lines is performed in a progressive manner. In one mode, data may not be prefetched. In a second mode, two cache lines are prefetched wherein one line is prefetched into the L1 cache and the next line is prefetched into a stream buffer. In a third mode, more than two cache lines are prefetched at a time. Prefetching may be performed on cache misses or hits. Cache misses on successive cache lines may allocate a stream of cache lines to the stream buffers. Control circuitry, coupled to a stream filter circuit, selectively controls fetching and prefetching of data from system memory to the primary and secondary caches associated with a processor and to a stream buffer circuit.

Patent
23 Aug 1995
TL;DR: In this article, prefetching of cache lines is performed in a progressive manner in a data processing system implementing L1 and L2 caches and stream filters and buffers, where data may not be prefetched.
Abstract: Within a data processing system implementing L1 and L2 caches and stream filters and buffers, prefetching of cache lines is performed in a progressive manner. In one mode, data may not be prefetched. In a second mode, two cache lines are prefetched wherein one line is prefetched into the L1 cache and the next line is prefetched into a stream buffer. In a third mode, more than two cache lines are prefetched at a time. In the third mode cache lines may be prefetched to the L1 cache and not the L2 cache, resulting in no inclusion between the L1 and L2 caches.

Proceedings ArticleDOI
23 Apr 1995
TL;DR: Investigation is extended to analyze energy effects from cache parameters in a multi-level cache design based on execution of SPECint92 benchmark programs with miss ratios of a RISC processor.
Abstract: To optimize performance and power of a processor’s cache, a multiple-divided module (MDM) cache architecture is proposed to save power at memory peripherals as well as the bit array. For a MxB-divided MDM cache, latency is equivalent to that of the smallest module and power consumption is only 1/MxB of the regular, non-divided cache. Based on the architecture and given transistor budgets for onchip processor caches, this paper extends investigation to analyze energy effects from cache parameters in a multi-level cache design. The analysis is based on execution of SPECint92 benchmark programs with miss ratios of a RISC processor.

Patent
13 Mar 1995
TL;DR: In this paper, a cache controller for a system having first and second level cache memories is presented, where a stack of registers coupled to the address pipeline is used to perform multiple line replacements of the first level cache memory without interfering with current first level look-ups.
Abstract: A cache controller for a system having first and second level cache memories. The cache controller has multiple stage address and data pipelines. A look-up system allows concurrent look-up of tag addresses in the first and second level caches using the address pipeline. The multiple stages allow a miss in the first level cache to be moved to the second stage so that the latency does not slow the look-up of a next address in the first level cache. A write data pipeline allows the look-up of data being written to the first level cache for current read operations. A stack of registers coupled to the address pipeline is used to perform multiple line replacements of the first level cache memory without interfering with current first level cache look-ups. Multiple banks associated with a multiple set associative cache are stored in a single chip, reducing the number of SRAMs required. Certain status information for the second level (L2) cache is stored with the status information of the first level cache. This enhances the speed of operations by avoiding a status look-up and modification in the L2 cache during a write operation. In addition, the L2 cache tag address and status bits are stored in a portion of one bank of the L2 data RAMs, further reducing the number of SRAMs required. Finally, the present invention also provides local read-write storage for use by the processor by reserving a number of L2 cache lines.

Journal ArticleDOI
21 May 1995
TL;DR: In this paper, the performance of 3-D-based RISC-systems is investigated and a model based on measured miss rates and on an analytical access time model is used.
Abstract: In this paper, potential performance improvements of the memory hierarchy of RISC-systems for implementations employing 3-D-technology are investigated. Relating to RISC-systems, 3-D ICs will offer the opportunity for integrating much more memory on-chip (i.e. on one IC or 3-D IC with the processor). As a result, the second-level cache may be moved on-chip. The available on-chip cache may alternatively be organized in three levels. Investigations were also performed for the case of the main memory being integrated on-chip. Current restrictions of conventional RISC-system implementations, such as limited available transistor count for on-chip caching, confined data bus width between processor-chip and the off-chip second-level cache, long access times of the second-level cache, strongly limit the achievable performance of the memory hierarchy and may be either removed or at least substantially reduced by the use of 3-D ICs. To evaluate the performance improvements of implementations employing 3-D ICs, a model based on measured miss rates and on an analytical access time model is used. The average time per-instruction is employed as the performance measure. Results of extensive case studies indicate, that substantial performance improvements depending on implementation, cache sizes, cache organization, and miss rates are achievable using 3-D ICs. A comparison of four optimized implementations all with a total cache size of approximately 1 MB yielded performance improvements in the range of 23% to 31% for the implementations employing 3-D-technology over the conventionally implemented system. It is concluded that 3-D-technology will be very attractive for future high performance RISC-systems, since the system performance depends vitally on the performance of the memory hierarchy.