scispace - formally typeset
Search or ask a question

Showing papers on "Smart Cache published in 2004"


Proceedings ArticleDOI
29 Sep 2004
TL;DR: It is found that optimizing fairness usually increases throughput, while maximizing throughput does not necessarily improve fairness, and two algorithms are proposed that optimize fairness.
Abstract: This paper presents a detailed study of fairness in cache sharing between threads in a chip multiprocessor (CMP) architecture. Prior work in CMP architectures has only studied throughput optimization techniques for a shared cache. The issue of fairness in cache sharing, and its relation to throughput, has not been studied. Fairness is a critical issue because the operating system (OS) thread scheduler's effectiveness depends on the hardware to provide fair cache sharing to co-scheduled threads. Without such hardware, serious problems, such as thread starvation and priority inversion, can arise and render the OS scheduler ineffective. This paper makes several contributions. First, it proposes and evaluates five cache fairness metrics that measure the degree of fairness in cache sharing, and shows that two of them correlate very strongly with the execution-time fairness. Execution-time fairness is defined as how uniform the execution times of co-scheduled threads are changed, where each change is relative to the execution time of the same thread running alone. Secondly, using the metrics, the paper proposes static and dynamic L2 cache partitioning algorithms that optimize fairness. The dynamic partitioning algorithm is easy to implement, requires little or no profiling, has low overhead, and does not restrict the cache replacement algorithm to LRU. The static algorithm, although requiring the cache to maintain LRU stack information, can help the OS thread scheduler to avoid cache thrashing. Finally, this paper studies the relationship between fairness and throughput in detail. We found that optimizing fairness usually increases throughput, while maximizing throughput does not necessarily improve fairness. Using a set of co-scheduled pairs of benchmarks, on average our algorithms improve fairness by a factor of 4/spl times/, while increasing the throughput by 15%, compared to a nonpartitioned shared cache.

544 citations


Journal ArticleDOI
TL;DR: The results show that smart cache management and scheduling is essential to achieve high performance with shared cache memory and can improve the total IPC significantly over the standard least recently used (LRU) replacement policy.
Abstract: This paper proposes dynamic cache partitioning amongst simultaneously executing processes/threads. We present a general partitioning scheme that can be applied to set-associative caches. Since memory reference characteristics of processes/threads can change over time, our method collects the cache miss characteristics of processes/threads at run-time. Also, the workload is determined at run-time by the operating system scheduler. Our scheme combines the information, and partitions the cache amongst the executing processes/threads. Partition sizes are varied dynamically to reduce the total number of misses. The partitioning scheme has been evaluated using a processor simulator modeling a two-processor CMP system. The results show that the scheme can improve the total IPC significantly over the standard least recently used (LRU) replacement policy. In a certain case, partitioning doubles the total IPC over standard LRU. Our results show that smart cache management and scheduling is essential to achieve high performance with shared cache memory.

402 citations


Journal ArticleDOI
02 Mar 2004
TL;DR: An adaptive policy that dynamically adapts to the costs and benefits of cache compression is developed and it is shown that compression can improve performance for memory-intensive commercial workloads by up to 17%.
Abstract: Modern processors use two or more levels ofcache memories to bridge the rising disparity betweenprocessor and memory speeds. Compression canimprove cache performance by increasing effectivecache capacity and eliminating misses. However,decompressing cache lines also increases cache accesslatency, potentially degrading performance.In this paper, we develop an adaptive policy thatdynamically adapts to the costs and benefits of cachecompression. We propose a two-level cache hierarchywhere the L1 cache holds uncompressed data and the L2cache dynamically selects between compressed anduncompressed storage. The L2 cache is 8-way set-associativewith LRU replacement, where each set can storeup to eight compressed lines but has space for only fouruncompressed lines. On each L2 reference, the LRUstack depth and compressed size determine whethercompression (could have) eliminated a miss or incurs anunnecessary decompression overhead. Based on thisoutcome, the adaptive policy updates a single globalsaturating counter, which predicts whether to allocatelines in compressed or uncompressed form.We evaluate adaptive cache compression usingfull-system simulation and a range of benchmarks. Weshow that compression can improve performance formemory-intensive commercial workloads by up to 17%.However, always using compression hurts performancefor low-miss-rate benchmarks-due to unnecessarydecompression overhead-degrading performance byup to 18%. By dynamically monitoring workload behavior,the adaptive policy achieves comparable benefitsfrom compression, while never degrading performanceby more than 0.4%.

304 citations


Journal ArticleDOI
Nimrod Megiddo1, Dharmendra S. Modha1
TL;DR: The self-tuning, low-overhead, scan-resistant adaptive replacement cache algorithm outperforms the least-recently-used algorithm by dynamically responding to changing access patterns and continually balancing between workload recency and frequency features.
Abstract: The self-tuning, low-overhead, scan-resistant adaptive replacement cache algorithm outperforms the least-recently-used algorithm by dynamically responding to changing access patterns and continually balancing between workload recency and frequency features. Caching, a fundamental metaphor in modern computing, finds wide application in storage systems, databases, Web servers, middleware, processors, file systems, disk drives, redundant array of independent disks controllers, operating systems, and other applications such as data compression and list updating. In a two-level memory hierarchy, a cache performs faster than auxiliary storage, but it is more expensive. Cost concerns thus usually limit cache size to a fraction of the auxiliary memory's size.

261 citations


Journal ArticleDOI
TL;DR: It is argued that a good caching policy adapts itself to changes in Web workload characteristics, and makes a qualitative comparison between these policies after classifying them according to the traffic properties they consider in their designs.
Abstract: The increasing demand for World Wide Web (WWW) services has made document caching a necessity to decrease download times and reduce Internet traffic. To make effective use of caching, an informative decision has to be made as to which documents are to be evicted from the cache in case of cache saturation. This is particularly important in a wireless network, where the size of the client cache at the mobile terminal (MT) is small. Several types of caching are used over the Internet, including client caching, server caching, and more recently, proxy caching. In this article we review some of the well known proxy-caching policies for the Web. We describe these policies, show how they operate, and discuss the main traffic properties they incorporate in their design. We argue that a good caching policy adapts itself to changes in Web workload characteristics. We make a qualitative comparison between these policies after classifying them according to the traffic properties they consider in their designs. Furthermore, we compare a selected subset of these policies using trace-driven simulations.

162 citations


Patent
Franklin Davis1, William Beaty
08 Sep 2004
TL;DR: In this paper, a system and method for smart, persistent cache management of received content within a terminal is presented, where received content is tagged with cache directive allowing cache control to determine which of cache storage locations to use for storage of content.
Abstract: A system and method for smart, persistent cache management of received content within a terminal. Received content is tagged with cache directive allowing cache control to determine which of cache storage locations to use for storage of content. Cache control detects the number of instances that received content correlates to a newer version of purged content and provides the ability to re-classify cache persistence directive based upon the number of instances.

160 citations


Proceedings ArticleDOI
02 Apr 2004
TL;DR: The results show that the PLRU techniques can approximate and even outperform LRU with much lower complexity, for a wide range of cache organizations, however, a relatively large gap between LRU and optimal replacement policy, of up to 50%, indicates that new research aimed to close the gap is necessary.
Abstract: Replacement policy, one of the key factors determining the effectiveness of a cache, becomes even more important with latest technological trends toward highly associative caches. The state-of-the-art processors employ various policies such as Random, Least Recently Used (LRU), Round-Robin, and PLRU (Pseudo LRU), indicating that there is no common wisdom about the best one. Optimal yet unattainable policy would replace cache memory block whose next reference is the farthest away in the future, among all memory blocks present in the set.In our quest for replacement policy as close to optimal as possible, we thoroughly explored the design space of existing replacement mechanisms using SimpleScalar toolset and SPEC CPU2000 benchmark suite, across wide range of cache sizes and organizations. In order to better understand the behavior of different policies, we introduced new measures, such as cumulative distribution of cache hits in the LRU stack. We also dynamically monitored the number of cache misses, per each 100000 instructions.Our results show that the PLRU techniques can approximate and even outperform LRU with much lower complexity, for a wide range of cache organizations. However, a relatively large gap between LRU and optimal replacement policy, of up to 50%, indicates that new research aimed to close the gap is necessary. The cumulative distribution of cache hits in the LRU stack indicates a very good potential for way prediction using LRU information, since the percentage of hits to the bottom of the LRU stack is relatively high.

158 citations


Journal ArticleDOI
TL;DR: This work investigates multiple approaches to effectively manage second-level buffer caches and reports a new local algorithm called multi-queue (MQ) that performs better than nine tested alternative algorithms for second-levels buffer caches, and a set of global algorithms that manage a multilevel buffer cache hierarchy globally and significantly improve second- level buffer cache hit ratios over corresponding local algorithms.
Abstract: Buffer caches are commonly used in servers to reduce the number of slow disk accesses or network messages. These buffer caches form a multilevel buffer cache hierarchy. In such a hierarchy, second-level buffer caches have different access patterns from first-level buffer caches because accesses to a second-level are actually misses from a first-level. Therefore, commonly used cache management algorithms such as the least recently used (LRU) replacement algorithm that work well for single-level buffer caches may not work well for second-level. We investigate multiple approaches to effectively manage second-level buffer caches. In particular, we report our research results in 1) second-level buffer cache access pattern characterization, 2) a new local algorithm called multi-queue (MQ) that performs better than nine tested alternative algorithms for second-level buffer caches, 3) a set of global algorithms that manage a multilevel buffer cache hierarchy globally and significantly improve second-level buffer cache hit ratios over corresponding local algorithms, and 4) implementation and evaluation of these algorithms in a real storage system connected with commercial database servers (Microsoft SQL server and Oracle) running industrial-strength online transaction processing benchmarks.

150 citations


Proceedings ArticleDOI
14 Feb 2004
TL;DR: An in-depth analysis of the pathological behavior of cache hashing functions is presented and two new hashing functions are proposed: prime modulo and prime displacement that are resistant to pathological behavior and yet are able to eliminate the worst case conflict behavior in the L2 cache are proposed.
Abstract: Using alternative cache indexing/hashing functions is a popular technique to reduce conflict misses by achieving a more uniform cache access distribution across the sets in the cache. Although various alternative hashing functions have been demonstrated to eliminate the worst case conflict behavior, no study has really analyzed the pathological behavior of such hashing functions that often result in performance slowdown. We present an in-depth analysis of the pathological behavior of cache hashing functions. Based on the analysis, we propose two new hashing functions: prime modulo and prime displacement that are resistant to pathological behavior and yet are able to eliminate the worst case conflict behavior in the L2 cache. We show that these two schemes can be implemented in fast hardware using a set of narrow add operations, with negligible fragmentation in the L2 cache. We evaluate the schemes on 23 memory intensive applications. For applications that have nonuniform cache accesses, both prime modulo and prime displacement hashing achieve an average speedup of 1.27 compared to traditional hashing, without slowing down any of the 23 benchmarks. We also evaluate using multiple prime displacement hashing functions in conjunction with a skewed associative L2 cache. The skewed associative cache achieves a better average speedup at the cost of some pathological behavior that slows down four applications by up to 7%.

133 citations


Proceedings ArticleDOI
14 Feb 2004
TL;DR: The spatial pattern predictor (SPP) is described, a cost-effective hardware mechanism that accurately predicts reference patterns within a spatial group at runtime, and requires only a small amount of predictor memory to store the predicted patterns.
Abstract: Recent research suggests that there are large variations in a cache's spatial usage, both within and across programs Unfortunately, conventional caches typically employ fixed cache line sizes to balance the exploitation of spatial and temporal locality, and to avoid prohibitive cache fill bandwidth demands The resulting inability of conventional caches to exploit spatial variations leads to suboptimal performance and unnecessary cache power dissipation We describe the spatial pattern predictor (SPP), a cost-effective hardware mechanism that accurately predicts reference patterns within a spatial group (ie, a contiguous region of data in memory) at runtime The key observation enabling an accurate, yet low-cost, SPP design is that spatial patterns correlate well with instruction addresses and data reference offsets within a cache line We require only a small amount of predictor memory to store the predicted patterns Simulation results for a 64-Kbyte 2-way set-associative Ll data cache with 64-byte lines show that: (1) a 256-entry tag-less direct-mapped SPP can achieve, on average, a prediction coverage of 95%, over-predicting the patterns by only 8%, (2) assuming a 70 nm process technology, the SPP helps reduce leakage energy in the base cache by 41% on average, incurring less than 1% performance degradation, and (3) prefetching spatial groups of up to 512 bytes using SPP improves execution time by 33% on average and up to a factor of two

113 citations


Journal ArticleDOI
TL;DR: The paper employs stretch as the major performance metric since it accounts for the data service time and, thus, is fair when items have different sizes and proves that Min-SAUD achieves optimal stretch under some standard assumptions.
Abstract: Data caching at mobile clients is an important technique for improving the performance of wireless data dissemination systems. However, variable data sizes, data updates, limited client resources, and frequent client disconnections make cache management a challenge. We propose a gain-based cache replacement policy, Min-SAUD, for wireless data dissemination when cache consistency must be enforced before a cached item is used. Min-SAUD considers several factors that affect cache performance, namely, access probability, update frequency, data size, retrieval delay, and cache validation cost. The paper employs stretch as the major performance metric since it accounts for the data service time and, thus, is fair when items have different sizes. We prove that Min-SAUD achieves optimal stretch under some standard assumptions. Moreover, a series of simulation experiments have been conducted to thoroughly evaluate the performance of Min-SAUD under various system configurations. The simulation results show that, in most cases, the Min-SAUD replacement policy substantially outperforms two existing policies, namely, LRU and SAIU.

Patent
20 Feb 2004
TL;DR: In this article, the authors propose a caching mechanism for a virtual persistent heap, which divides the VPR heap into cache lines, the smallest amount of VPR space that can be loaded or flushed at one time.
Abstract: A caching mechanism for a virtual persistent heap. A feature of a virtual persistent heap is the method used to cache portions of the virtual persistent heap into the physical heap. The caching mechanism may be effective with small consumer and appliance devices that typically have a small amount of memory and that may be using flash devices as persistent storage. In the caching mechanism, the virtual persistent heap may be divided into cache lines. A cache line is the smallest amount of virtual persistent heap space that can be loaded or flushed at one time. Caching in and caching out operations are used to load cache lines into the heap or to flush dirty cache lines into the store. Different cache line sizes may be used for different regions of the heap. Translation between a virtual persistent heap address and the heap may be simplified by the caching mechanism.

Proceedings ArticleDOI
16 Feb 2004
TL;DR: This work introduces the two-level cache tuner, or TCaT - a heuristic for searching the huge solution space of possible configurations and shows the integrity of the heuristic across multiple memory configurations and even in the presence of hardware/software partitioning.
Abstract: The power consumed by the memory hierarchy of a microprocessor can contribute to as much as 50% of the total microprocessor system power, and is thus a good candidate for optimizations. We present an automated method for tuning two-level caches to embedded applications for reduced energy consumption. The method is applicable to both a simulation-based exploration environment and a hardware-based system prototyping environment. We introduce the two-level cache tuner, or TCaT - a heuristic for searching the huge solution space of possible configurations. The heuristic interlaces the exploration of the two cache levels and searches the various cache parameters in a specific order based on their impact on energy. We show the integrity of our heuristic across multiple memory configurations and even in the presence of hardware/software partitioning -- a common optimization capable of achieving significant speedups and/or reduced energy consumption. We apply our exploration heuristic to a large set of embedded applications. Our experiments demonstrate the efficacy of our heuristic: on average the heuristic examines only 7% of the possible cache configurations, but results in cache sub-system energy savings of 53%, only 1% more than the optimal cache configuration. In addition, the configured cache achieves an average speedup of 30% over the base cache configuration due to tuning of cache line size to the application's needs.

Journal ArticleDOI
TL;DR: This work introduces on-chip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune the cache to an executing program, completely transparently to the programmer.
Abstract: Memory accesses often account for about half of a microprocessor system's power consumption. Customizing a microprocessor cache's total size, line size, and associativity to a particular program is well known to have tremendous benefits for performance and power. Customizing caches has until recently been restricted to core-based flows, in which a new chip will be fabricated. However, several configurable cache architectures have been proposed recently for use in prefabricated microprocessor platforms. Tuning those caches to a program is still, however, a cumbersome task left for designers, assisted in part by recent computer-aided design (CAD) tuning aids. We propose to move that CAD on-chip, which can greatly increase the acceptance of tunable caches. We introduce on-chip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune the cache to an executing program. Our heuristic seeks not only to reduce the number of configurations that must be examined, but also traverses the search space in a way that minimizes costly cache flushes. By simulating numerous Powerstone and MediaBench benchmarks, we show that such a dynamic self-tuning cache saves on average 40p of total memory access energy over a standard nontuned reference cache.

Proceedings ArticleDOI
19 Apr 2004
TL;DR: This paper uses trace driven simulations to compare traditional cache replacement policies with new policies that try to exploit characteristics of the P2P file-sharing traffic generated by applications using the FastTrack protocol.
Abstract: Peer-to-peer (P2P) file-sharing applications generate a large part if not most of today's Internet traffic. The large volume of this traffic (thus the high potential benefits of caching) and the large cache sizes required (thus nontrivial costs associated with caching) only underline that efficient cache replacement policies are important in this case. P2P file-sharing traffic has several characteristics that distinguish it from well studied Web traffic and that require a focused study of efficient cache management policies. This paper uses trace driven simulations to compare traditional cache replacement policies with new policies that try to exploit characteristics of the P2P file-sharing traffic generated by applications using the FastTrack protocol.

Proceedings ArticleDOI
27 Jun 2004
TL;DR: The perhaps surprising result that for sufficiently parallel computations the shared cache need only be an additive size larger than the single-processor cache is given, and some theoretical justification for designing machines with shared caches is given.
Abstract: We compare the number of cache misses M1 for running a computation on a single processor with cache size C1 to the total number of misses Mp for the same computation when using p processors or threads and a shared cache of size Cp. We show that for any computation, and with an appropriate (greedy) parallel schedule, if Cp ≥ C1 + pd then Mp ≤ M1. The depth d of the computation is the length of the critical path of dependences. This gives the perhaps surprising result that for sufficiently parallel computations the shared cache need only be an additive size larger than the single-processor cache, and gives some theoretical justification for designing machines with shared caches.We model a computation as a DAG and the sequential execution as a depth first schedule of the DAG. The parallel schedule we study is a parallel depth-first schedule (PDF schedule) based on the sequential one. The schedule is greedy and therefore work-efficient. Our main results assume the Ideal Cache model, but we also present results for other more realistic cache models.

Patent
24 Jan 2004
TL;DR: In this article, a method of caching commands in microprocessors having a plurality of arithmetic units and in modules having a two- or multidimensional cell arrangement is provided, which includes combining the plurality of cells and arithmetic units to form groups, assigning a cache unit to a group, and connecting the cache unit with a higher level unit via a tree structure.
Abstract: A method of caching commands in microprocessors having a plurality of arithmetic units and in modules having a two- or multidimensional cell arrangement is provided. The method includes combining a plurality of cells and arithmetic units to form a plurality of groups, assigning a cache unit to a group, and connecting the cache unit to a higher level unit via a tree structure. The cache unit may send requests for required commands to the higher level cache unit, which may return a command sequence including the required command, if the higher level cache unit holds the first command sequence including the required command in the higher level cache unit's local memory.

Book ChapterDOI
25 Oct 2004
TL;DR: This paper proposes a different cache architecture, intended to ease WCET analysis, where the cache stores complete methods and cache misses occur only on method invocation and return.
Abstract: Cache memories are mandatory to bridge the growing gap between CPU speed and main memory access time. Standard cache organizations improve the average execution time but are difficult to predict for worst case execution time (WCET) analysis. This paper proposes a different cache architecture, intended to ease WCET analysis. The cache stores complete methods and cache misses occur only on method invocation and return. Cache block replacement depends on the call tree, instead of instruction addresses.

Proceedings ArticleDOI
13 Jun 2004
TL;DR: This work answers the question "Why does a database system incur so many instruction cache misses" and proposes techniques to buffer database operations during query execution to avoid instruction cache thrashing.
Abstract: As more and more query processing work can be done in main memory access is becoming a significant cost component of database operations. Recent database research has shown that most of the memory stalls are due to second-level cache data misses and first-level instruction cache misses. While a lot of research has focused on reducing the data cache misses, relatively little research has been done on improving the instruction cache performance of database systems.We first answer the question "Why does a database system incur so many instruction cache misses?" We demonstrate that current demand-pull pipelined query execution engines suffer from significant instruction cache thrashing between different operators. We propose techniques to buffer database operations during query execution to avoid instruction cache thrashing. We implement a new light-weight "buffer" operator and study various factors which may affect the cache performance. We also introduce a plan refinement algorithm that considers the query plan and decides whether it is beneficial to add additional "buffer" operators and where to put them. The benefit is mainly from better instruction locality and better hardware branch prediction. Our techniques can be easily integrated into current database systems without significant changes. Our experiments in a memory-resident PostgreSQL database system show that buffering techniques can reduce the number of instruction cache misses by up to 80% and improve query performance by up to 15%.

Patent
29 Sep 2004
TL;DR: In this paper, a system and method for providing dynamic mobile cache for mobile computing devices is described, where a cache is created at a server at the time a communication session between the server and a client is initiated.
Abstract: A system and method are described for providing dynamic mobile cache for mobile computing devices. In one embodiment, a cache is created at a server at the time a communication session between the server and a client is initiated. The server then determined whether the client requires the cache. If it is determined the client requires the cache, the server provides the cache to the client.

Proceedings ArticleDOI
16 Feb 2004
TL;DR: This work introduces on-chip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune the cache to an executing program, completely transparently to the programmer.
Abstract: Memory accesses can account for about half of a microprocessor system's power consumption. Customizing a microprocessor cache's total size, line size and associativity to a particular program is well known to have tremendous benefits for performance and power. Customizing caches has until recently been restricted to core-based flows, in which a new chip will be fabricated. However, several configurable cache architectures have been proposed recently for use in pre-fabricated microprocessor platforms. Tuning those caches to a program is still however a cumbersome task left for designers, assisted in part by recent computer-aided design (CAD) tuning aids. We propose to move that CAD on-chip, which can greatly increase the acceptance of configurable caches. We introduce on-chip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune the cache to an executing program. We carefully designed the heuristic to avoid any cache flushing, since flushing is power and performance costly. By simulating numerous Powerstone and MediaBench benchmarks, we show that such a dynamic self-tuning cache can reduce memory-access energy by 45% to 55% on average, and as much as 97%, compared with a four-way set-associative base cache, completely transparently to the programmer.

Patent
Sailesh Kottapalli1
29 Dec 2004
TL;DR: In this paper, an apparatus and method for fairly accessing a shared cache with multiple resources, such as multiple cores, multiple threads, or both, is described. But it is not discussed how to assign a static portion of the cache and a dynamic portion.
Abstract: An apparatus and method for fairly accessing a shared cache with multiple resources, such as multiple cores, multiple threads, or both are herein described. A resource within a microprocessor sharing access to a cache is assigned a static portion of the cache and a dynamic portion. The resource is blocked from victimizing static portions assigned to other resources, yet, allowed to victimize the static portion assigned to the resource and the dynamically shared portion. If the resource does not access the cache enough times over a period of time, the static portion assigned to the resource is reassigned to the dynamically shared portion.

01 Jan 2004
TL;DR: This paper proposes a dynamic cache partitioning method for simultaneous multithreading systems that collects the miss-rate characteristics of simultaneously executing threads at runtime, and partitions the cache among the executing threads.
Abstract: This paper proposes a dynamic cache partitioning method for simultaneous multithreading systems. We present a general partitioning scheme that can be applied to setassociative caches at any partition granularity. Furthermore, in our scheme threads can have overlapping partitions, which provides more degrees of freedom when partitioning caches with low associativity. Since memory reference characteristics of threads can change very quickly, our method collects the miss-rate characteristics of simultaneously executing threads at runtime, and partitions the cache among the executing threads. Partition sizes are varied dynamically to improve hit rates. Trace-driven simulation results show a relative improvement in the L2 hit-rate of up to 40.5% over those generated by the standard least recently used replacement policy, and IPC improvements of up to 17%. Our results show that smart cache management and scheduling is important for SMT systems to achieve high performance.

Journal ArticleDOI
TL;DR: An approach and an implementation of a dynamic proxy caching technique which combines the benefits of both proxy-based and back-end caching approaches, yet does not suffer from their above-mentioned limitations is presented.
Abstract: As Internet traffic continues to grow and websites become increasingly complex, performance and scalability are major issues for websites. Websites are increasingly relying on dynamic content generation applications to provide website visitors with dynamic, interactive, and personalized experiences. However, dynamic content generation comes at a cost---each request requires computation as well as communication across multiple components.To address these issues, various dynamic content caching approaches have been proposed. Proxy-based caching approaches store content at various locations outside the site infrastructure and can improve website performance by reducing content generation delays, firewall processing delays, and bandwidth requirements. However, existing proxy-based caching approaches either (a) cache at the page level, which does not guarantee that correct pages are served and provides very limited reusability, or (b) cache at the fragment level, which is associated with several design-level and runtime scalability issues. To address these issues, several back-end caching approaches have been proposed, including query result caching and fragment level caching. While back-end approaches guarantee the correctness of results and offer the advantages of fine-grained caching, they neither address firewall delays nor reduce bandwidth requirements.In this article, we present an approach and an implementation of a dynamic proxy caching technique which combines the benefits of both proxy-based and back-end caching approaches, yet does not suffer from their above-mentioned limitations. Our dynamic proxy caching technique allows granular, proxy-based caching in highly dynamic scenarios, accessible outside the site infrastructure. We present two possible configurations for our dynamic proxy caching technique: (1) a reverse proxy configuration, and (2) a forward proxy configuration. Analysis of the performance of our approach indicates that it is capable of providing significant reductions in bandwidth. We have deployed our proposed dynamic proxy caching technique at a major financial institution. The results of this implementation indicate that our technique is capable of providing up to 3x reductions in bandwidth and response times in real-world dynamic Web applications when compared to existing caching solutions.


Patent
06 Jul 2004
TL;DR: In this paper, a database system providing methodology for extended memory support is described, which comprises steps of: creating a secondary cache in memory available to the database system; mapping a virtual address range to at least a portion of the secondary cache; when the primary cache is full, replacing pages from the primary caches using the secondary caches; in response to a request for a particular page, searching for the particular page in the secondary cached page if the particular cached page is not found in the primary cached page; if the cached page was found in secondary cache, determining the virtual address in
Abstract: A database system providing methodology for extended memory support is described. In one embodiment, for example, a method is described for extended memory support in a database system having a primary cache, the method comprises steps of: creating a secondary cache in memory available to the database system; mapping a virtual address range to at least a portion of the secondary cache; when the primary cache is full, replacing pages from the primary cache using the secondary cache; in response to a request for a particular page, searching for the particular page in the secondary cache if the particular page is not found in the primary cache; if the particular page is found in the secondary cache, determining a virtual address in the secondary cache where the particular page resides based on the mapping; and swapping the particular page found in the secondary cache with a page in the primary cache, so as to replace a page in the primary cache with the particular page from the secondary cache.

Proceedings ArticleDOI
30 Mar 2004
TL;DR: This work shows that the standard hash join algorithm/or disk-oriented databases (i.e. GRACE) spends over 73% of its user time stalled on CPU cache misses, and explores the use of prefetching to improve its cache performance.
Abstract: Hash join algorithms suffer from extensive CPU cache stalls. We show that the standard hash join algorithm/or disk-oriented databases (i.e. GRACE) spends over 73% of its user time stalled on CPU cache misses, and explores the use of prefetching to improve its cache performance. Applying prefetching to hash joins is complicated by the data dependencies, multiple code paths, and inherent randomness of hashing. We present two techniques, group prefetching and software-pipelined prefetching, that overcome these complications. These schemes achieve 2.0-2.9X speedups for the join phase and 1.4-2.6X speedups for the partition phase over GRACE and simple prefetching approaches. Compared with previous cache-aware approaches (i.e. cache partitioning), the schemes are at least 50% faster on large relations and do not require exclusive use of the CPU cache to be effective.

Proceedings ArticleDOI
07 Jun 2004
TL;DR: The results show the technique is accurate to within 20% of miss rate for uniprocessors and was able to reduce the die area of a multiprocessor chip by a projected 14% over a naive design by accurately sizing caches for each processor.
Abstract: As multiprocessor systems-on-chip become a reality, performance modeling becomes a challenge. To quickly evaluate many architectures, some type of high-level simulation is required, including high-level cache simulation. We propose to perform this cache simulation by defining a metric to represent memory behavior independently of cache structure and back-annotate this into the original application. While the annotation phase is complex, requiring time comparable to normal address trace based simulation, it need only be performed once per application set and thus enables simulation to be sped up by a factor of 20 to 50 over trace based simulation. This is important for embedded systems, as software is often evaluated against many input sets and many architectures. Our results show the technique is accurate to within 20% of miss rate for uniprocessors and was able to reduce the die area of a multiprocessor chip by a projected 14% over a naive design by accurately sizing caches for each processor.

Patent
27 Jan 2004
TL;DR: In this paper, a multimedia apparatus comprises a cache buffer configured to be coupled to a storage device, wherein the cache buffer stores multimedia data, including video and audio data, read from the storage device.
Abstract: Systems and methods are provided for caching media data to thereby enhance media data read and/or write functionality and performance. A multimedia apparatus, comprises a cache buffer configured to be coupled to a storage device, wherein the cache buffer stores multimedia data, including video and audio data, read from the storage device. A cache manager coupled to the cache buffer, wherein the cache buffer is configured to cause the storage device to enter into a reduced power consumption mode when the amount of data stored in the cache buffer reaches a first level.

Patent
30 Jun 2004
TL;DR: In this article, a method and apparatus for partitioning a shared cache of a chip multi-processor are described, which includes a request of a cache block from system memory if a cache miss within a share cache is detected according to a received request from a processor.
Abstract: A method and apparatus for partitioning a shared cache of a chip multi-processor are described. In one embodiment, the method includes a request of a cache block from system memory if a cache miss within a shared cache is detected according to a received request from a processor. Once the cache block is requested, a victim block within the shared cache is selected according to a processor identifier and a request type of the received request. In one embodiment, selection of the victim block according to a processor identifier and request type is based on a partition of a set-associative, shared cache to limit the selection of the victim block from a subset of available cache ways according to the cache partition. Other embodiments are described and claimed.