scispace - formally typeset
Search or ask a question

Showing papers on "Cache algorithms published in 2000"


Journal ArticleDOI
TL;DR: This paper demonstrates the benefits of cache sharing, measures the overhead of the existing protocols, and proposes a new protocol called "summary cache", which reduces the number of intercache protocol messages, reduces the bandwidth consumption, and eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP.
Abstract: The sharing of caches among Web proxies is an important technique to reduce Web traffic and alleviate network bottlenecks. Nevertheless it is not widely deployed due to the overhead of existing protocols. In this paper we demonstrate the benefits of cache sharing, measure the overhead of the existing protocols, and propose a new protocol called "summary cache". In this new protocol, each proxy keeps a summary of the cache directory of each participating proxy, and checks these summaries for potential hits before sending any queries. Two factors contribute to our protocol's low overhead: the summaries are updated only periodically, and the directory representations are very economical, as low as 8 bits per entry. Using trace-driven simulations and a prototype implementation, we show that, compared to existing protocols such as the Internet cache protocol (ICP), summary cache reduces the number of intercache protocol messages by a factor of 25 to 60, reduces the bandwidth consumption by over 50%, eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP. Hence summary cache scales to a large number of proxies. (This paper is a revision of Fan et al. 1998; we add more data and analysis in this version.).

2,174 citations


Journal ArticleDOI
16 May 2000
TL;DR: A new indexing technique called CSB+-Trees is proposed that stores all the child nodes of any given node contiguously, and keeps only the address of the first child in each node, and introduces two variants of CSB+, which can reduce the copying cost when there is a split and preallocate space for the full node group to reduce the split cost.
Abstract: Previous research has shown that cache behavior is important for main memory index structures. Cache conscious index structures such as Cache Sensitive Search Trees (CSS-Trees) perform lookups much faster than binary search and T-Trees. However, CSS-Trees are designed for decision support workloads with relatively static data. Although B+-Trees are more cache conscious than binary search and T-Trees, their utilization of a cache line is low since half of the space is used to store child pointers. Nevertheless, for applications that require incremental updates, traditional B+-Trees perform well.Our goal is to make B+-Trees as cache conscious as CSS-Trees without increasing their update cost too much. We propose a new indexing technique called “Cache Sensitive B+-Trees” (CSB+-Trees). It is a variant of B+-Trees that stores all the child nodes of any given node contiguously, and keeps only the address of the first child in each node. The rest of the children can be found by adding an offset to that address. Since only one child pointer is stored explicitly, the utilization of a cache line is high. CSB+-Trees support incremental updates in a way similar to B+-Trees.We also introduce two variants of CSB+-Trees. Segmented CSB+-Trees divide the child nodes into segments. Nodes within the same segment are stored contiguously and only pointers to the beginning of each segment are stored explicitly in each node. Segmented CSB+-Trees can reduce the copying cost when there is a split since only one segment needs to be moved. Full CSB+-Trees preallocate space for the full node group and thus reduce the split cost. Our performance studies show that CSB+-Trees are useful for a wide range of applications.

398 citations


Patent
24 Nov 2000
TL;DR: In this paper, a preloader uses a cache replacement manager to manage requests for retrievals, insertions, and removal of web page components in a component cache, and a profile server predicts a user's next content request.
Abstract: A preloader works in conjunction with a web/app server and optionally a profile server to cache web page content elements or components for faster on-demand and anticipatory dynamic web page delivery. The preloader uses a cache manager to manage requests for retrievals, insertions, and removal of web page components in a component cache. The preloader uses a cache replacement manager to manage the replacement of components in the cache. While the cache replacement manager may utilize any cache replacement policy, a particularly effective replacement policy utilizes predictive information to make replacement decisions. Such a policy uses a profile server, which predicts a user's next content request. The components that can be cached are identified by tagging them within the dynamic scripts that generate them. The preloader caches components that are likely to be accessed next, thus improving a web site's scalability.

366 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel replacement policy, called LRV, which selects for replacement the document with the lowest relative value among those in cache, and shows how LRV outperforms least recently used (LRU) and other policies and can significantly improve the performance of the cache, especially for a small one.
Abstract: In this paper, we analyze access traces to a Web proxy, looking at statistical parameters to be used in the design of a replacement policy for documents held in the cache. In the first part of this paper, we present a number of properties of the lifetime and statistics of access to documents, derived from two large trace sets coming from very different proxies and spanning over time intervals of up to five months. In the second part, we propose a novel replacement policy, called LRV, which selects for replacement the document with the lowest relative value among those in cache. In LRV, the value of a document is computed adaptively based on information readily available to the proxy server. The algorithm has no hardwired constants, and the computations associated with the replacement policy require only a small constant time. We show how LRV outperforms least recently used (LRU) and other policies and can significantly improve the performance of the cache, especially for a small one.

338 citations


Patent
23 Mar 2000
TL;DR: In this paper, a plurality of cache servers capable of caching WWW information provided by the information servers are provided in association with the wireless network, and the cache servers can be managed by receiving a message indicating at least a connected location of a mobile computer in a wireless network from the mobile computer, selecting one or more cache servers located nearby the mobile computers according to the message, and controlling these one or multiple cache servers to cache selected web information selected for the mobile Computer, so as to enable faster accesses to the selected Web information by the Mobile Computer.
Abstract: In the disclosed information delivery scheme for delivering WWW information provided by information servers on the Internet to mobile computers connected to the Internet through a wireless network, a plurality of cache servers capable of caching WWW information provided by the information servers are provided in association with the wireless network. The cache servers can be managed by receiving a message indicating at least a connected location of a mobile computer in the wireless network from the mobile computer, selecting one or more cache servers located nearby the mobile computer according to the message, and controlling these one or more cache servers to cache selected WWW information selected for the mobile computer, so as to enable faster accesses to the selected WWW information by the mobile computer. Also, the cache servers can be managed by selecting one or more cache servers located within a geographic range defined for an information provider who provides WWW information from an information server, and controlling these one or more cache servers to cache selected WWW information selected for the information provider, so as to enable faster accesses to the selected WWW information by the mobile computer.

300 citations


Proceedings ArticleDOI
01 Aug 2000
TL;DR: This paper focuses on the features of the M340 cache sub-system and illustrates the effect on power and performance through benchmark analysis and actual silicon measurements.
Abstract: Advances in technology have allowed portable electronic devices to become smaller and more complex, placing stringent power and performance requirements on the devices components. The M7CORE M3 architecture was developed specifically for these embedded applications. To address the growing need for longer battery life and higher performance, an 8-Kbyte, 4-way set-associative, unified (instruction and data) cache with pro-grammable features was added to the M3 core. These features allow the architecture to be optimized based on the applications requirements. In this paper, we focus on the features of the M340 cache sub-system and illustrate the effect on power and perfor-mance through benchmark analysis and actual silicon measure-ments.

253 citations


Proceedings ArticleDOI
09 Jul 2000
TL;DR: The initial experiments on iterative data-parallel applications show that the work-stealing scheduling algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50% over static partitioning under multiprogrammed work loads and a locality-guided work stealing algorithm that improves the data locality of multi-threaded computations by allowing a thread to have an affinity for a processor.
Abstract: This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a locality-guided work-stealing algorithm along with experimental validation.As a lower bound, we show that there is a family of multi-threaded computations Gn each member of which requires T(n) total instructions (work) for which when using work-stealing the number of cache misses on one processor is constant, while even on two processors the total number of cache misses is O(n). This implies that for general computations there is no useful bound relating multiprocessor to uninprocessor cache misses. For nested-parallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single processor is bounded by O(C⌈m/sPT∞), where m is the execution time of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T∞ is the number of nodes on the longest chain of dependences. Based on this we give strong bounds on the total running time of nested-parallel computations using work stealing.For the second part of our results, we present a locality-guided work stealing algorithm that improves the data locality of multi-threaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative data-parallel applications show that the algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50% over static partitioning under multiprogrammed work loads. Furthermore, the locality-guided work stealing improves the performance of work-stealing up to 80%.

227 citations


Proceedings ArticleDOI
01 May 2000
TL;DR: A practical, fully associative, software-managed secondary cache system that provides performance competitive with or superior to traditional caches without OS or application involvement is presented.
Abstract: As DRAM access latencies approach a thousand instruction-execution times and on-chip caches grow to multiple megabytes, it is not clear that conventional cache structures continue to be appropriate. Two key features—full associativity and software management—have been used successfully in the virtual-memory domain to cope with disk access latencies. Future systems will need to employ similar techniques to deal with DRAM latencies. This paper presents a practical, fully associative, software-managed secondary cache system that provides performance competitive with or superior to traditional caches without OS or application involvement. We see this structure as the first step toward OS- and application-aware management of large on-chip caches.This paper has two primary contributions: a practical design for a fully associative memory structure, the indirect index cache (IIC), and a novel replacement algorithm, generational replacement, that is specifically designed to work with the IIC. We analyze the behavior of an IIC with generational replacement as a drop-in, transparent substitute for a conventional secondary cache. We achieve miss rate reductions from 8% to 85% relative to a 4-way associative LRU organization, matching or beating a (practically infeasible) fully associative true LRU cache. Incorporating these miss rates into a rudimentary timing model indicates that the IIC/generational replacement cache could be competitive with a conventional cache at today's DRAM latencies, and will outperform a conventional cache as these CPU-relative latencies grow.

224 citations


Proceedings ArticleDOI
24 Apr 2000
TL;DR: The paper presents a benchmark, CommBench, for use in evaluating and designing telecommunications network processors, and characteristics such as instruction frequencies, computational complexity, and cache performance are presented.
Abstract: The paper presents a benchmark, CommBench, for use in evaluating and designing telecommunications network processors. The benchmark applications focus on small, computationally intense program kernels typical of the network processor environment. The benchmark is composed of eight programs, four of them oriented towards packet header processing and four oriented towards data stream processing. The benchmark is defined and characteristics such as instruction frequencies, computational complexity, and cache performance are presented. These measured characteristics are compared to the standard SPEC benchmark. Three examples are presented indicating how CommBench can aid in the design of a single chip network multiprocessor.

215 citations


Patent
25 Jan 2000
TL;DR: In this paper, a relatively high-speed, intermediate-volume storage device is operated as a user-configurable cache, where data is preloaded and responsively cached in the cache memory based on user preferences.
Abstract: An apparatus and method for caching data in a storage device (26) of a computer system (10). A relatively high-speed, intermediate-volume storage device (25) is operated as a user-configurable cache. Requests to access a mass storage device (46) such as a disk or tape (26, 28) are intercepted by a device driver (32) that compares the access request against a directory (51) of the contents of the user-configurable cache (25). If the user-configurable cache contains the data sought to be accessed, the access request is carried out in the user-configurable cache instead of being forwarded to the device driver for the target mass storage device (46). Because the user-cache is implemented using memory having a dramatically shorter access time than most mechanical mass storage devices, the access request is fulfilled much more quickly than if the originally intended mass storage device was accessed. Data is preloaded and responsively cached in the user-configurable cache memory based on user preferences.

205 citations


Proceedings ArticleDOI
01 Dec 2000
TL;DR: The design and evaluation of the compression cache (CC) is presented which is a first level cache that has been designed so that each cache line can either hold one uncompressed line or two cache lines which have been compressed to at least half their lengths.
Abstract: Since the area occupied by cache memories on processor chips continues to grow, an increasing percentage of power is consumed by memory. We present the design and evaluation of the compression cache (CC) which is a first level cache that has been designed so that each cache line can either hold one uncompressed line or two cache lines which have been compressed to at least half their lengths. We use a novel data compression scheme based upon encoding of a small number of valves that appear frequently during memory accesses. This compression scheme preserves the ability to randomly access individual data items. We observed that the contents of 40%, 52% and 51% of the memory blocks of size 4, 8, and 16 words respectively in SPECint95 benchmarks can be compressed to at least half their sizes by encoding the top 2, 4, and 8 frequent valves respectively. Compression allows greater amounts of data to be stored leading to substantial reductions in miss rates (0-36.4%), off-chip traffic (3.9-48.1%), and energy consumed (1-27%). Traffic and energy reductions are in part derived by transferring data over external buses in compressed form.

Journal ArticleDOI
01 May 2000
TL;DR: Results of incorporating instruction cache predictions within pipeline simulation show that timing predictions for set-associative caches remain just as tight as predictions for direct-mapped caches.
Abstract: This paper contributes a comprehensive study of a framework to bound worst-case instruction cache performance for caches with arbitrary levels of associativity. The framework is formally introduced, operationally described and its correctness is shown. Results of incorporating instruction cache predictions within pipeline simulation show that timing predictions for set-associative caches remain just as tight as predictions for direct-mapped caches. The low cache simulation overhead allows interactive use of the analysis tool and scales well with increasing associativity. The approach taken is based on a data-flow specification of the problem and provides another step toward worst-case execution time prediction of contemporary architectures and its use in schedulability analysis for hard real-time systems.

Proceedings ArticleDOI
10 Apr 2000
TL;DR: This work presents an on-line algorithm that effectively captures and maintains an accurate popularity profile of Web objects requested through a caching proxy that is superior to a host of recently-proposed and widely-used algorithms using extensive trace-driven simulations and a variety of performance metrics.
Abstract: Web caching aims at reducing network traffic, server load and user-perceived retrieval delays by replicating popular content on proxy caches that are strategically placed within the network. While key to effective cache utilization, popularity information (e.g. relative access frequencies of objects requested through a proxy) is seldom incorporated directly in cache replacement algorithms. Rather other properties of the request stream (e.g. temporal locality and content size), which are easier to capture in an online fashion, are used to indirectly infer popularity information, and hence drive cache replacement policies. Recent studies suggest that the correlation between these secondary properties and popularity is weakening due in part to the prevalence of efficient client and proxy caches. This trend points to the need for proxy cache replacement algorithms that directly capture popularity information. We present an on-line algorithm that effectively captures and maintains an accurate popularity profile of Web objects requested through a caching proxy. We propose a novel cache replacement policy that uses such information to generalize the well-known greedy dual-size algorithm, and show the superiority of our proposed algorithm by comparing it to a host of recently-proposed and widely-used algorithms using extensive trace-driven simulations and a variety of performance metrics.

Proceedings ArticleDOI
01 Dec 2000
TL;DR: Dynamic zero compression reduces the energy required for cache accesses by only writing and reading a single bit for every zero-valued byte and an instruction recoding technique is described that increases instruction cache energy savings to 18%.
Abstract: Dynamic zero compression reduces the energy required for cache accesses by only writing and reading a single bit for every zero-valued byte. This energy-conscious compression is invisible to software and is handled with additional circuitry embedded inside the cache RAM arrays and the CPU. The additional circuitry imposes a cache area overhead of 9% and a read latency overhead of around two F04 gate delays. Simulation results show that we can reduce total data cache energy by around 26% and instruction cache energy by around 10% for SPECint95 and MediaBench benchmarks. We also describe the use of an instruction recoding technique that increases instruction cache energy savings to 18%.

Book ChapterDOI
15 Jul 2000
TL;DR: With this application, it is shown that symbolic model checking tools originally designed for hybrid and concurrent systems can be applied successfully to a new class of infinite-state systems of practical interest.
Abstract: We propose a new method for the verification of parameterized cache coherence protocols. Cache coherence protocols are used to maintain data consistency in multiprocessor systems equipped with local fast caches. In our approach we use arithmetic constraints to model possibly infinite sets of global states of a multiprocessor system with many identical caches. In preliminary experiments using symbolic model checkers for infinite-state systems based on real arithmetics (HyTech [HHW97] and DMC [DP99])) we have automatically verified safety properties for parameterized versions of widely implemented write-invalidate and write-update cache coherence policies like the Mesi, Berkeley, Illinois, Firefly and Dragon protocols [Han93]. With this application, we show that symbolic model checking tools originally designed for hybrid and concurrent systems can be applied successfully to a new class of infinite-state systems of practical interest.

Patent
07 Dec 2000
TL;DR: In this article, a priority determination scheme was proposed to determine whether to keep or discard the transmitted digital objects in a local cache database, based on global demand data and/or local demand data.
Abstract: A system and method accelerates the distribution of digital content of a global communications network such as the Internet. A central proxy server selects popular digital objects for transmission over a communication medium to provide content filling of cache databases attendant to local proxy servers. The communication medium may comprise satellite transmission using an IP multicast protocol. The local proxy servers concurrently receive the digital objects at a high rate of speed and store the digital objects in the attendant local cache databases. The local proxy servers may utilize a localized priority determination scheme to determine whether to keep or discard the transmitted digital objects. The priority determination scheme may utilize global demand data and/or local demand data. The demand data may include hits and/or misses on digital objects and may also include quantitative data about the digital objects. The priority determination scheme may be driven by feedback regarding the needs and interests of subscribing users of the local cache database. Consequently, the priority determination scheme and ultimately, the contents of a local cache database, may be unique to that local cache database.

Patent
29 Nov 2000
TL;DR: In this article, a cache coupled with one or more web clients requests web documents from web servers on behalf of those web clients and communicates those web documents to the web clients for display.
Abstract: The invention provides a method and system for reducing latency in reviewing and presenting web documents to the user. A cache coupled to one or more web clients request web documents from web servers on behalf of those web clients and communicates those web documents to the web clients for display. The cache parses the web documents as they are received from the web server, identifies references to any embedded objects, and determines if those embedded objects are already maintained in the cache. If those embedded objects are not in the cache, the cache automatically pre-fetches those embedded objects from the web server without need for a command from the web client. The cache maintains a two-level memory including primary memory and secondary mass storage. At the time the web document is received, the cache determines if any embedded objects are maintained in the cache but are not in primary memory. If those embedded objects are not in primary memory, the cache automatically pre-loads those embedded objects from secondary mass storage to primary memory without need for a request from the web client. Web documents maintained in the cache are periodically refreshed, so as to assure those web documents are not stale. The invention is applied both to original requests to communicate web documents and their embedded objects from the web server to the web client, and to refresh requests to communicate web documents and their embedded objects from the web server to the cache.

Patent
12 Jun 2000
TL;DR: In this paper, the authors propose a Web page cache that stores Web pages such that servers will be able to retrieve valid dynamic pages without going to a dynamic content server or origin Web server for the page every time a user requests that dynamic page.
Abstract: A Web page cache that stores Web pages such that servers will be able to retrieve valid dynamic pages without going to a dynamic content server or origin Web server for the page every time a user requests that dynamic page. The dynamic content cache receives information that defines data upon which each dynamic page is dependent, such that when the value of any dependency data item changes, the associated dynamic page is marked as invalid or deleted. The dynamic page cache stores dependency data, receives change event information, and indicates when pages in the cache are invalidated or need to be refreshed.

Patent
19 Apr 2000
TL;DR: In this article, a disk drive consisting of a cache memory and a cache control system with a tag memory having a plurality of tag records, and means for allocating a tag record for responding to a host command is described.
Abstract: The present invention relates to a disk drive 10 comprising a cache memory 14 and a cache control system having a tag memory having a plurality of tag records, and means for allocating a tag record for responding to a host command. The cache memory has a plurality of sequentially-ordered memory clusters 46 for caching disk data stored in sectors (not shown) on disks of a disk assembly 38 . Conventionally the disk sectors are identified by logical block addresses (LBAs). The cache control system 12 along with the tag memory 22 and means for allocating tag records are embedded within the cache control system 12 and thereby configured only for use in defining variable length segments of the memory clusters 46 . The segments are defined without regard to the sequential order of the memory clusters 46.

Journal Article
TL;DR: In this paper, suitable blocking strategies for both structured and unstructured grids will be introduced to improve the cache usage without changing the underlying algorithm.
Abstract: Many current computer designs employ caches and a hierarchical memory architecture. The speed of a code depends on how well the cache structure is exploited. The number of cache misses provides a better measure for comparing algorithms than the number of multiplies. In this paper, suitable blocking strategies for both structured and unstructured grids will be introduced. They improve the cache usage without changing the underlying algorithm. In particular, bitwise compatibility is guaranteed between the standard and the high performance implementations of the algorithms. This is illustrated by comparisons for various multigrid algorithms on a selection of different computers for problems in two and three dimensions. The code restructuring can yield performance improvements of factors of 2-5. This allows the modified codes to achieve a much higher percentage of the peak performance of the CPU than is usually observed with standard implementations.

Proceedings ArticleDOI
01 Jun 2000
TL;DR: This approach is the first one to measure and optimize the power consumption of a complete SOC comprising a CPU, instruction cache, data cache, main memory, data buses and address bus through code compression.
Abstract: We propose instruction code compression as an efficient method for reducing power on an embedded system. Our approach is the first one to measure and optimize the power consumption of a complete SOC (System--On--a--Chip) comprising a CPU, instruction cache, data cache, main memory, data buses and address bus through code compression. We compare the pre-cache architecture (decompressor between main memory and cache) to a novel post-cache architecture (decompressor between cache and CPU). Our simulations and synthesis results show that our methodology results in large energy savings between 22% and 82% compared to the same system without code compression. Furthermore, we demonstrate that power savings come with reduced chip area and the same or even improved performance.

Proceedings ArticleDOI
22 Oct 2000
TL;DR: A Unified Buffer Management (UBM) scheme that exploits reference regularities and yet, is simple to deploy is presented that improves the hit ratios and reduces the elapsed times of the LRU scheme.
Abstract: In traditional file system implementations, the Least Recently Used (LRU) block replacement scheme is widely used to manage the buffer cache due to its simplicity and adaptability. However, the LRU scheme exhibits performance degradations because it does not make use of reference regularities such as sequential and looping references. In this paper, we present a Unified Buffer Management (UBM) scheme that exploits these regularities and yet, is simple to deploy. The UBM scheme automatically detects sequential and looping references and stores the detected blocks in separate partitions of the buffer cache. These partitions are managed by appropriate replacement schemes based on their detected patterns. The allocation problem among the divided partitions is also tackled with the use of the notion of marginal gains. In both trace-driven simulation experiments and experimental studies using an actual implementation in the FreeBSD operating system, the performance gains obtained through the use of this scheme are substantial. The results show that the hit ratios improve by as much as 57.7% (with an average of 29.2%) and the elapsed times are reduced by as much as 67.2% (with an average of 28.7%) compared to the LRU scheme for the workloads we used.

Proceedings ArticleDOI
08 Jan 2000
TL;DR: This work shows that an opportunity exists to close part of the gap between the OPT and the LRU algorithms, and presents a replacement algorithm based on the detection of temporal locality in lines residing in the L2 cache that improves on the second-level cache miss rate.
Abstract: Main memory accesses continue to be a significant bottleneck for applications whose working sets do not fit in second-level caches With the trend of greater associativity in second-level caches, implementing effective replacement algorithms might become more important than reducing conflict misses After showing that an opportunity exists to close part of the gap between the OPT and the LRU algorithms, we present a replacement algorithm based on the detection of temporal locality in lines residing in the L2 cache Rather than always replacing the LRU line, the victim is chosen by considering both its priority in the LRU stack and whether it exhibits temporal locality or not We consider two strategies which use this replacement algorithm: a profile-based scheme where temporal locality is detected by processing a trace from a training set of the application and an on-line scheme, where temporal locality is detected with the assistance of a small locality table Both schemes improve on the second-level cache miss rate over a pure LRU algorithm, by as much as 12% in the profiling case and 20% in the dynamic case

Patent
31 Jul 2000
TL;DR: In this paper, the authors present a cache control method for caching disk data in a disk drive configured to receive commands for both streaming and non-streaming data from a host, where a lossy state record is provided for memory segments in a cache memory.
Abstract: The present invention may be embodied in a cache control method for caching disk data in a disk drive configured to receive commands for both streaming and non-streaming data from a host. A lossy state record is provided for memory segments in a cache memory. The lossy state record allows hosts commands to be mixed for streaming and non-streaming data without flushing of cache data for a command mode change.

Proceedings ArticleDOI
01 May 2000
TL;DR: A generalization of time skewing for multiprocessor architectures is given, and techniques for using multilevel caches reduce the LI cache requirement, which would otherwise be unacceptably high for some architectures when using arrays of high dimension.
Abstract: Time skewing is a compile-time optimization that can provide arbitrarily high cache hit rates for a class of iterative calculations, given a sufficient number of time steps and sufficient cache memory. Thus, it can eliminate processor idle time caused by inadequate main memory bandwidth. In this article, we give a generalization of time skewing for multiprocessor architectures, and discuss time skewing for multilevel caches. Our generalization for multiprocessors lets us eliminate processor idle time caused by any combination of inadequate main memory bandwidth, limited network bandwidth, and high network latency, given a sufficiently large problem and sufficient cache. As in the uniprocessor case, the cache requirement grows with the machine balance rather than the problem size. Our techniques for using multilevel caches reduce the LI cache requirement, which would otherwise be unacceptably high for some architectures when using arrays of high dimension.

Proceedings ArticleDOI
01 Jun 2000
TL;DR: A way to improve the performance of embedded processors running data-intensive applications by allowing software to allocate on-chip memory on an application-specific basis via a novel hardware mechanism, called column caching.
Abstract: We propose a way to improve the performance of embedded processors running data-intensive applications by allowing software to allocate on-chip memory on an application-specific basis. On-chip memory in the form of cache can be made to act like scratch-pad memory via a novel hardware mechanism, which we call column caching. Column caching enables dynamic cache partitioning in software, by mapping data regions to a specified sets of cache “columns” or “ways.” When a region of memory is exclusively mapped to an equivalent sized partition of cache, column caching provides the same functionality and predictability as a dedicated scratchpad memory for time-critical parts of a real-time application. The ratio between scratchpad size and cache size can be easily and quickly varied for each application, or each task within an application. Thus, software has much finer software control of on-chip memory, providing the ability to dynamically tradeoff performance for on-chip memory.

Journal ArticleDOI
TL;DR: The importance of different Web proxy workload characteristics in making good cache replacement decisions is analyzed, and results indicate that higher cache hit rates are achieved using size-based replacement policies.

Patent
26 Apr 2000
TL;DR: In this paper, the cache employs one or more prefetch ways for storing prefetch cache lines and accessed cache lines, while cache lines fetched in response to cache misses for requests initiated by a microprocessor connected to the cache are stored into non-prefetch ways.
Abstract: A cache employs one or more prefetch ways for storing prefetch cache lines and one or more ways for storing accessed cache lines. Prefetch cache lines are stored into the prefetch way, while cache lines fetched in response to cache misses for requests initiated by a microprocessor connected to the cache are stored into the non-prefetch ways. Accessed cache lines are thereby maintained within the cache separately from prefetch cache lines. When a prefetch cache line is presented to the cache for storage, the prefetch cache line may displace another prefetch cache line but does not displace an accessed cache line. A cache hit in either the prefetch way or the non-prefetch ways causes the cache line to be delivered to the requesting microprocessor in a cache hit fashion. The cache is further configured to move prefetch cache lines from the prefetch way to the non-prefetch way if the prefetch cache lines are requested (i.e. they become accessed cache lines). Instruction cache lines may be moved immediately upon access, while data cache line accesses may be counted and a number of accesses greater than a predetermined threshold value may occur prior to moving the data cache line from the prefetch way to the non-prefetch way. Additionally, movement of an accessed cache line from the prefetch way to the non-prefetch way may be delayed until the accessed cache line is to be replaced by a prefetch cache line.

Patent
07 Dec 2000
TL;DR: In this paper, the cache memory is partitioned among a set of threads of a multi-threaded processor, and when a cache miss occurs, a replacement line is selected in a partition of the cache space which is allocated to the particular thread from which the access causing the cache miss originated, thereby preventing pollution to partitions belonging to other threads.
Abstract: A method and apparatus which provides a cache management policy for use with a cache memory for a multi-threaded processor. The cache memory is partitioned among a set of threads of the multi-threaded processor. When a cache miss occurs, a replacement line is selected in a partition of the cache memory which is allocated to the particular thread from which the access causing the cache miss originated, thereby preventing pollution to partitions belonging to other threads.

Patent
Walter A. Hubis1
25 Jan 2000
TL;DR: In this paper, a method for providing cache coherency in a RAID system (100) in which multiple RAID controllers (104) provide read/write access to shared storage devices (108) for multiple host computers (102).
Abstract: A method for providing cache coherency in a RAID system (100) in which multiple RAID controllers (104) provide read/write access to shared storage devices (108) for multiple host computers (102). Each controller includes read (114), write (116) and write mirror (118) caches and the controllers and the shared storage devices are coupled to one another via common backend buses (110). Whenever a controller receives a write command (302) from a host the controller writes the data to the shared devices, its write cache and the write mirror caches of the other controllers. Whenever a controller receives a read command (320) from a host the controller attempts to return the requested data from its write mirror cache, write cache and read cache and the storage devices, in that order.