scispace - formally typeset
Search or ask a question

Showing papers on "Cache algorithms published in 2005"


Proceedings ArticleDOI
12 Feb 2005
TL;DR: Three performance models are proposed that predict the impact of cache sharing on co-scheduled threads and the most accurate model, the inductive probability model, achieves an average error of only 3.9%.
Abstract: This paper studies the impact of L2 cache sharing on threads that simultaneously share the cache, on a chip multi-processor (CMP) architecture. Cache sharing impacts threads nonuniformly, where some threads may be slowed down significantly, while others are not. This may cause severe performance problems such as sub-optimal throughput, cache thrashing, and thread starvation for threads that fail to occupy sufficient cache space to make good progress. Unfortunately, there is no existing model that allows extensive investigation of the impact of cache sharing. To allow such a study, we propose three performance models that predict the impact of cache sharing on co-scheduled threads. The input to our models is the isolated L2 cache stack distance or circular sequence profile of each thread, which can be easily obtained on-line or off-line. The output of the models is the number of extra L2 cache misses for each thread due to cache sharing. The models differ by their complexity and prediction accuracy. We validate the models against a cycle-accurate simulation that implements a dual-core CMP architecture, on fourteen pairs of mostly SPEC benchmarks. The most accurate model, the inductive probability model, achieves an average error of only 3.9%. Finally, to demonstrate the usefulness and practicality of the model, a case study that details the relationship between an application's temporal reuse behavior and its cache sharing impact is presented.

543 citations


Patent
31 Oct 2005
TL;DR: In this paper, the edge DNS cache servers are published as the authoritative servers for customer domains instead of the origin server, and when a request for a DNS record results in a cache miss, the edge cache servers get the information from the origin servers and cache it for use in response to future requests.
Abstract: A distributed DNS network includes a central origin server that actually controls the zone, and edge DNS cache servers configured to cache the DNS content of the origin server. The edge DNS cache servers are published as the authoritative servers for customer domains instead of the origin server. When a request for a DNS record results in a cache miss, the edge DNS cache servers get the information from the origin server and cache it for use in response to future requests. Multiple edge DNS cache servers can be deployed at multiple locations. Since an unlimited number of edge DNS cache servers can be deployed, the system is highly scalable. The disclosed techniques protect against DoS attacks, as DNS requests are not made to the origin server directly.

370 citations


Journal ArticleDOI
01 May 2005
TL;DR: This paper presents a new cache management policy, victim replication, which combines the advantages of private and shared schemes, and shows that victim replication reduces the average memory access latency of the shared L2 cache by an average of 16% for multi-threaded benchmarks and 24% for single-threading benchmarks.
Abstract: In this paper, we consider tiled chip multiprocessors (CMP) where each tile contains a slice of the total on-chip L2 cache storage and tiles are connected by an on-chip network. The L2 slices can be managed using two basic schemes: 1) each slice is treated as a private L2 cache for the tile 2) all slices are treated as a single large L2 cache shared by all tiles. Private L2 caches provide the lowest hit latency but reduce the total effective cache capacity, as each tile creates local copies of any line it touches. A shared L2 cache increases the effective cache capacity for shared data, but incurs long hit latencies when L2 data is on a remote tile. We present a new cache management policy, victim replication, which combines the advantages of private and shared schemes. Victim replication is a variant of the shared scheme which attempts to keep copies of local primary cache victims within the local L2 cache slice. Hits to these replicated copies reduce the effective latency of the shared L2 cache, while retaining the benefits of a higher effective capacity for shared data. We evaluate the various schemes using full-system simulation of both single-threaded and multi-threaded benchmarks running on an 8-processor tiled CMP. We show that victim replication reduces the average memory access latency of the shared L2 cache by an average of 16%for multi-threaded benchmarks and 24%for single-threaded benchmarks, providing better overall performance than either private or shared schemes.

331 citations


Patent
07 Mar 2005
TL;DR: A buffer cache interposed between a non-volatile memory and a host may be partitioned into segments that may operate with different policies as mentioned in this paper, such as write-through, write and read-look-ahead.
Abstract: A buffer cache interposed between a non-volatile memory and a host may be partitioned into segments that may operate with different policies. Cache policies include write-through, write and read-look-ahead. Write-through and write back policies may improve speed. Read-look-ahead cache allows more efficient use of the bus between the buffer cache and non-volatile memory. A session command allows data to be maintained in volatile memory by guaranteeing against power loss.

256 citations


Patent
13 Jun 2005
TL;DR: A peer-to-peer name resolution protocol (PNRP) is proposed in this paper, which allows resolution of names which are mapped onto the circular number space through a hash function.
Abstract: A serverless name resolution protocol ensures convergence despite the size of the network, without requiring an ever-increasing cache and with a reasonable numbers of hops. This convergence is ensured through a multi-level cache and a proactive cache initialization strategy. The multi-level cache is built based on a circular number space. Each level contains information from different levels of slivers of the circular space. A mechanism is included to add a level to the multi-level cache when the node determines that the last level is full. A peer-to-peer name resolution protocol (PNRP) includes a mechanism to allow resolution of names which are mapped onto the circular number space through a hash function. Further, the PNRP may also operate with the domain name system by providing each node with an identification consisting of a domain name service (DNS) component and a unique number.

228 citations


Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is demonstrated that migratory, dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased power consumption and complexity, especially as per-application cache partitioning strategies are applied.
Abstract: We propose an organization for the on-chip memory system of a chip multiprocessor, in which 16 processors share a 16MB pool of 256 L2 cache banks. The L2 cache is organized as a non-uniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support the spectrum of degrees of sharing: unshared, in which each processor has a private portion of the cache, thus reducing hit latency, completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We find the optimal degree of sharing for a number of cache bank mapping policies, and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of two or four work best across a suite of commercial and scientific parallel workloads. We also demonstrate that migratory, dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased power consumption and complexity, especially as per-application cache partitioning strategies are applied.

218 citations


Journal ArticleDOI
C.W. Slayman1
TL;DR: In most system applications, a combination of several techniques is required to meet the necessary reliability and data-integrity targets, and the tradeoffs of these techniques in terms of area, power, and performance penalties versus increased reliability are covered.
Abstract: As the size of the SRAM cache and DRAM memory grows in servers and workstations, cosmic-ray errors are becoming a major concern for systems designers and end users. Several techniques exist to detect and mitigate the occurrence of cosmic-ray upset, such as error detection, error correction, cache scrubbing, and array interleaving. This paper covers the tradeoffs of these techniques in terms of area, power, and performance penalties versus increased reliability. In most system applications, a combination of several techniques is required to meet the necessary reliability and data-integrity targets.

205 citations


Journal ArticleDOI
01 May 2005
TL;DR: The proposed variable-way, or V-Way, set-associative cache achieves an average miss rate reduction of 13% on sixteen benchmarks from the SPEC CPU2000 suite, which translates into an average IPC improvement of 8%.
Abstract: As processor speeds increase and memory latency becomes more critical, intelligent design and management of secondary caches becomes increasingly important. The efficiency of current set-associative caches is reduced because programs exhibit a non-uniform distribution of memory accesses across different cache sets. We propose a technique to vary the associativity of a cache on a per-set basis in response to the demands of the program. By increasing the number of tag-store entries relative to the number of data lines, we achieve the performance benefit of global replacement while maintaining the constant hit latency of a set-associative cache. The proposed variable-way, or V-Way, set-associative cache achieves an average miss rate reduction of 13% on sixteen benchmarks from the SPEC CPU2000 suite. This translates into an average IPC improvement of 8%.

204 citations


Proceedings ArticleDOI
Matteo Frigo1, Volker Strumpen1
20 Jun 2005
TL;DR: This work presents a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods, and it exploits temporal locality optimally throughout the entire memory hierarchy.
Abstract: We present a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods. Our algorithm applies to arbitrary stencils in n-dimensional spaces. On an "ideal cache" of size Z, our algorithm saves a factor of Θ(Z1/n) cache misses compared to a naive algorithm, and it exploits temporal locality optimally throughout the entire memory hierarchy.

196 citations


Patent
22 Nov 2005
TL;DR: In this paper, a cache server is coupled with a media serving engine that is capable of caching media content, and a set of cache policies is accessible by the cache engine to define the operation of the cache.
Abstract: A cache server includes a media serving engine that is capable of distributing media content. A cache engine is coupled to the media serving engine and capable of caching media content. A set of cache policies is accessible by the cache engine to define the operation of the cache engine. The cache server can be configured to operate as either a cache server or an origin server. The cache server also includes a data communication interface coupled to the cache engine and the media serving engine to allow the cache engine to receive media content across a network and to allow the media serving engine to distribute media content across the network. The cache policies include policies for distributing media content from the media server, policies for handling cache misses, and policies for prefetching media content.

165 citations


Journal ArticleDOI
TL;DR: A GHB (global history buffer) supports existing prefetch algorithms more effectively than conventional prefetch tables and contains a more complete picture of cache miss history, which reduces stale table data, improving accuracy and reducing memory traffic.
Abstract: Over the past couple of decades, trends in both microarchitecture and underlying semiconductor technology have significantly reduced microprocessor clock periods These trends have significantly increased relative main-memory latencies as measured in processor clock cycles To avoid large performance losses caused by long memory access delays, microprocessors rely heavily on a hierarchy of cache memories But cache memories are not always effective, either because they are not large enough to hold a program's working set, or because memory access patterns don't exhibit behavior that matches a cache memory's demand-driven, line-structured organization To partially overcome cache memories' limitations, we organize data cache prefetch information in a new way, a GHB (global history buffer) supports existing prefetch algorithms more effectively than conventional prefetch tables It reduces stale table data, improving accuracy and reducing memory traffic It contains a more complete picture of cache miss history and is smaller than conventional tables

Patent
31 Jan 2005
TL;DR: Data-aware cache as discussed by the authors is a method and system directed to improve effectiveness and efficiency of cache and data management by differentiating data based on certain attributes associated with the data and reducing the bottleneck to storage.
Abstract: A method and system directed to improve effectiveness and efficiency of cache and data management by differentiating data based on certain attributes associated with the data and reducing the bottleneck to storage. The data-aware cache differentiates and manages data using a state machine having certain states. The data-aware cache may use data pattern and traffic statistics to retain frequently used data in cache longer by transitioning it into Sticky or StickyDirty states. The data-aware cache may also use content or application related attributes to differentiate and retain certain data in cache longer. Further, the data-aware cache may provide cache status and statistics information to a data-aware data flow manager, thus assisting data-aware data flow manager to determine which data to cache and which data to pipe directly through, or to switch cache policies dynamically, thus avoiding some of the overhead associated with caches. The data-aware cache may also place clean and dirty data in separate states, enabling more efficient cache mirroring and flush, thus improve system reliability and performance.

Proceedings ArticleDOI
04 Apr 2005
TL;DR: A new attack against a software implementation of the Advanced Encryption Standard, aimed at flushing elements of the SBOX from the cache, thus inducing a cache miss during the encryption phase, which can be used to recover part of the secret key.
Abstract: This paper presents a new attack against a software implementation of the Advanced Encryption Standard. The attack aims at flushing elements of the SBOX from the cache, thus inducing a cache miss during the encryption phase. The power trace is then used to detect when the cache miss occurs; if the miss happens in the first round of the AES then the information can be used to recover part of the secret key. The attack has been simulated using the Wattch simulation framework and a simple software implementation of AES (using a single table for the SBOX). The attack can be easily extended to more sophisticated versions of AES with more than one table. Eventually, we present a simple countermeasure which does not require randomization.

Proceedings Article
30 Aug 2005
TL;DR: A notion of XPath query/ view answerability is described, which allows us to reduce tree operations to string operations for matching a query/view pair, and it is shown how to store and maintain the cached views in relational tables, so that cache lookup is very efficient.
Abstract: In this paper, we propose a method for maintaining a semantic cache of materialized XPath views. The cached views include queries that have been previously asked, and additional selected views. The cache can be stored inside or outside the database. We describe a notion of XPath query/view answerability, which allows us to reduce tree operations to string operations for matching a query/view pair. We show how to store and maintain the cached views in relational tables, so that cache lookup is very efficient. We also describe a technique for view selection, given a warm-up workload. We experimentally demonstrate the efficiency of our caching techniques, and performance gains obtained by employing such a cache.

Journal ArticleDOI
TL;DR: An offline energy-optimal cache replacement algorithm using dynamic programming, which minimizes the disk energy consumption and an offline power-aware greedy algorithm that is more energy-efficient than Belady's offline algorithm (which minimizes cache misses only).
Abstract: Reducing energy consumption is an important issue for data centers. Among the various components of a data center, storage is one of the biggest energy consumers. Previous studies have shown that the average idle period for a server disk in a data center is very small compared to the time taken to spin down and spin up. This significantly limits the effectiveness of disk power management schemes. This article proposes several power-aware storage cache management algorithms that provide more opportunities for the underlying disk power management schemes to save energy. More specifically, we present an offline energy-optimal cache replacement algorithm using dynamic programming, which minimizes the disk energy consumption. We also present an offline power-aware greedy algorithm that is more energy-efficient than Belady's offline algorithm (which minimizes cache misses only). We also propose two online power-aware algorithms, PA-LRU and PB-LRU. Simulation results with both a real system and synthetic workloads show that, compared to LRU, our online algorithms can save up to 22 percent more disk energy and provide up to 64 percent better average response time. We have also investigated the effects of four storage cache write policies on disk energy consumption.

Proceedings Article
Binny S. Gill1, Dharmendra S. Modha1
10 Apr 2005
TL;DR: A self-tuning, low overhead, simple to implement, locally adaptive, novel cache management policy SARC is designed that dynamically and adaptively partitions the cache space amongst sequential and random streams so as to reduce the read misses.
Abstract: Sequentiality of reference is an ubiquitous access pattern dating back at least to Multics. Sequential workloads lend themselves to highly accurate prediction and prefetching. In spite of the simplicity of the workload, design and analysis of a good sequential prefetching algorithm and associated cache replacement policy turns out to be surprisingly intricate. As first contribution, we uncover and remedy an anomaly (akin to famous Belady's anomaly) that plagues sequential prefetching when integrated with caching. Typical workloads contain a mix of sequential and random streams. As second contribution, we design a self-tuning, low overhead, simple to implement, locally adaptive, novel cache management policy SARC that dynamically and adaptively partitions the cache space amongst sequential and random streams so as to reduce the read misses. As third contribution, we implemented SARC along with two popular state-of-the-art LRU variants on hardware for IBM's flagship storage controller Shark. On Shark hardware with 8 GB cache and 16 RAID-5 arrays that is serving a workload akin to Storage Performance Council's widely adopted SPC-1 benchmark, SARC consistently and dramatically outperforms the two LRU variants shifting the throughput-response time curve to the right and thus fundamentally increasing the capacity of the system. As anecdotal evidence, at the peak throughput, SARC has average response time of 5.18ms as compared to 33.35ms and 8.92ms for the two LRU variants.

Journal ArticleDOI
01 Jul 2005
TL;DR: An out-of-core multilevel minimization algorithm is designed and implemented and its performance is tested on unstructured meshes composed of tens to hundreds of millions of triangles, which can significantly reduce the number of cache misses.
Abstract: We present a novel method for computing cache-oblivious layouts of large meshes that improve the performance of interactive visualization and geometric processing algorithms. Given that the mesh is accessed in a reasonably coherent manner, we assume no particular data access patterns or cache parameters of the memory hierarchy involved in the computation. Furthermore, our formulation extends directly to computing layouts of multi-resolution and bounding volume hierarchies of large meshes.We develop a simple and practical cache-oblivious metric for estimating cache misses. Computing a coherent mesh layout is reduced to a combinatorial optimization problem. We designed and implemented an out-of-core multilevel minimization algorithm and tested its performance on unstructured meshes composed of tens to hundreds of millions of triangles. Our layouts can significantly reduce the number of cache misses. We have observed 2--20 times speedups in view-dependent rendering, collision detection, and isocontour extraction without any modification of the algorithms or runtime applications.

Proceedings ArticleDOI
06 Jul 2005
TL;DR: A conservative polynomial algorithm is presented that extends real-time scheduling analysis to consider cache effects due to the preempted and the preempting task for the preemption delay for each task accurately.
Abstract: Accurate timing analysis is key to efficient embedded system synthesis and integration. Caches are needed to increase the processor performance but they are hard to use because of their complex behaviour especially in preemptive scheduling. Current approaches use simplified assumptions or propose exponentially complex scheduling analysis algorithms to bound the cache related preemption delay at a context switch. We present a conservative polynomial algorithm that extends real-time scheduling analysis to consider cache effects due to the preempted and the preempting task for the preemption delay. Dataflow analysis on task level is combined with real-time scheduling analysis to determine the response time including cache related preemption delay for each task accurately. The experiments show significant improvement in analysis precision over previous polynomial approaches for typical embedded benchmarks.

Proceedings Article
10 Apr 2005
TL;DR: To use the L2 cache as efficiently as possible, an L2-conscious scheduling algorithm is proposed and it is possible to reduce miss ratios in the L1 cache and improve processor throughput by 27-45%.
Abstract: We investigated how operating system design should be adapted for multithreaded chip multiprocessors (CMT) - a new generation of processors that exploit thread-level parallelism to mask the memory latency in modern workloads. We determined that the L2 cache is a critical shared resource on CMT and that an insufficient amount of L2 cache can undermine the ability to hide memory latency on these processors. To use the L2 cache as efficiently as possible, we propose an L2-conscious scheduling algorithm and quantify its performance potential. Using this algorithm it is possible to reduce miss ratios in the L2 cache by 25-37% and improve processor throughput by 27-45%.

Patent
Mason Cabot1
20 Dec 2005
TL;DR: In this article, a cache line eviction mechanism is proposed that enables programmers to mark portions of code with different cache priority levels based on anticipated or measured access patterns for those code portions.
Abstract: A method and apparatus to enable programmatic control of cache line eviction policies. A mechanism is provided that enables programmers to mark portions of code with different cache priority levels based on anticipated or measured access patterns for those code portions. Corresponding cues to assist in effecting the cache eviction policies associated with given priority levels are embedded in machine code generated from source-and/or assembly-level code. Cache architectures are provided that partition cache space into multiple pools, each pool being assigned a different priority. In response to execution of a memory access instruction, an appropriate cache pool is selected and searched based on information contained in the instruction's cue. On a cache miss, a cache line is selected from that pool to be evicted using a cache eviction policy associated with the pool. Implementations of the mechanism or described for both n-way set associative caches and fully-associative caches.

Proceedings ArticleDOI
20 Mar 2005
TL;DR: A new method to accurately estimate the reliability of cache memories is presented and three different techniques are presented to reduce the susceptibility of first-level caches to soft errors by two orders of magnitude.
Abstract: Cosmic-ray induced soft errors in cache memories are becoming a major threat to the reliability of microprocessor-based systems. In this paper, we present a new method to accurately estimate the reliability of cache memories. We have measured the MTTF (mean-time-to-failure) of unprotected first-level (L1) caches for twenty programs taken from SPEC2000 benchmark suite. Our results show that a 16 KB first-level cache possesses a MTTF of at least 400 years (for a raw error rate of 0.002 FIT/bit.) However, this MTTF is significantly reduced for higher error rates and larger cache sizes. Our results show that for selected programs, a 64 KB first-level cache is more than 10 times as vulnerable to soft errors versus a 16 KB cache memory. Our work also illustrates that the reliability of cache memories is highly application-dependent. Finally, we present three different techniques to reduce the susceptibility of first-level caches to soft errors by two orders of magnitude. Our analysis shows how to achieve a balance between performance and reliability

Proceedings ArticleDOI
06 Jun 2005
TL;DR: StatCache is presented, a performance tool based on a statistical cache model that has a small run-time overhead while providing much of the flexibility of simulator-based tools and demonstrates how the flexibility can be used to better understand the characteristics of cache-related performance problems.
Abstract: Performance tools based on hardware counters can efficiently profile the cache behavior of an application and help software developers improve its cache utilization. Simulator-based tools can potentially provide more insights and flexibility and model many different cache configurations, but have the drawback of large run-time overhead.We present StatCache, a performance tool based on a statistical cache model. It has a small run-time overhead while providing much of the flexibility of simulator-based tools. A monitor process running in the background collects sparse memory access statistics about the analyzed application running natively on a host computer. Generic locality information is derived and presented in a code-centric and/or data-centric view.We evaluate the accuracy and performance of the tool using ten SPEC CPU2000 benchmarks. We also exemplify how the flexibility of the tool can be used to better understand the characteristics of cache-related performance problems.

Patent
Josh Sacks1
11 Aug 2005
TL;DR: In this paper, the authors proposed a cache-based approach to cache pre-computed map images (e.g., map tiles) to minimize the latency of a mapping application on high-latency and low-throughput networks.
Abstract: Techniques are disclosed that enable users to access and use digital mapping systems with constrained-resource services and/or mobile devices (e.g., cell phones and PDAs). In particular, latency of a mapping application on high-latency and low-throughput networks is minimized. One embodiment utilizes volatile and non-volatile storage of the mobile device to cache pre-computed map images (e.g., map tiles). An asynchronous cache can be used to prevent delays caused by potentially slow non-volatile storage. Meta-data about each map image and usage patterns can be stored and used by the cache to optimize hit rates.

Proceedings ArticleDOI
06 Jun 2005
TL;DR: This paper shows that kernel prefetching can have a significant impact on the relative performance in terms of the number of actual disk l/Os of many well-known replacement algorithms; it can not only narrow the performance gap but also change therelative performance benefits of different algorithms.
Abstract: A fundamental challenge in improving the file system performance is to design effective block replacement algorithms to minimize buffer cache misses. Despite the well-known interactions between prefetching and caching, almost all buffer cache replacement algorithms have been proposed and studied comparatively without taking into account file system prefetching which exists in all modern operating systems. This paper shows that such kernel prefetching can have a significant impact on the relative performance in terms of the number of actual disk I/Os of many well-known replacement algorithms; it can not only narrow the performance gap but also change the relative performance benefits of different algorithms. These results demonstrate the importance for buffer caching research to take file system prefetching into consideration.

Patent
22 Dec 2005
TL;DR: In this article, the authors present a method for run-time cache optimization based on profiling a program code during a runtime execution, logging the performance for producing a cache log, and rearranging a portion of program code in view of the cache log for producing rearranged portion.
Abstract: A method ( 400 ) and system ( 106 ) is provided for run-time cache optimization. The method includes profiling ( 402 ) a performance of a program code during a run-time execution, logging ( 408 ) the performance for producing a cache log, and rearranging ( 410 ) a portion of program code in view of the cache log for producing a rearranged portion. The rearranged portion is supplied to a memory management unit ( 240 ) for managing at least one cache memory ( 110 - 140 ). The cache log can be collected during a real-time operation of a communication device and is fed back to a linking process ( 244 ) to maximize a cache locality compile-time. The method further includes loading a saved profile corresponding with a run-time operating mode, and reprogramming a new code image associated with the saved profile.

Patent
30 Sep 2005
TL;DR: In this paper, instruction-assisted cache management for efficient use of cache and memory is discussed, where hints (e.g., modifiers) are added to read and write memory access instructions to identify the memory access for temporal data.
Abstract: Instruction-assisted cache management for efficient use of cache and memory. Hints (e.g., modifiers) are added to read and write memory access instructions to identify the memory access is for temporal data. In view of such hints, alternative cache policy and allocation policies are implemented that minimize cache and memory access. Under one policy, a write cache miss may result in a write of data to a partial cache line without a memory read/write cycle to fill the remainder of the line. Under another policy, a read cache miss may result in a read from memory without allocating or writing the read data to a cache line. A cache line soft-lock mechanism is also disclosed, wherein cache lines may be selectably soft locked to indicate preference for keeping those cache lines over non-locked lines.

Patent
30 Dec 2005
TL;DR: In this article, a method and system for providing granular timed invalidation of dynamically generated objects stored in a cache is presented, which can cache objects with expiry times down to very small intervals of time.
Abstract: The present invention is directed towards a method and system for providing granular timed invalidation of dynamically generated objects stored in a cache. The techniques of the present invention incorporates the ability to configure the expiration time of objects stored by the cache to fine granular time intervals, such as the granularity of time intervals provided by a packet processing timer of a packet processing engine. As such, the present invention can cache objects with expiry times down to very small intervals of time. This characteristic is referred to as “invalidation granularity.” By providing this fine granularity in expiry time, the cache of the present invention can cache and serve objects that frequently change, sometimes even many times within a second. One technique is to leverage the packet processing timers used by the device of the present invention that are able operate at time increments on the order of milliseconds to permit invalidation or expiry granularity down to 10 ms or less.

Journal ArticleDOI
TL;DR: This paper introduces IATAC (inter-access time per access count), a new hardware technique to reduce cache leakage for L2 caches that outperforms all previous state-of-the-art techniques.
Abstract: As technology evolves, power dissipation increases and cooling systems become more complex and expensive. There are two main sources of power dissipation in a processor: dynamic power and leakage. Dynamic power has been the most significant factor, but leakage will become increasingly significant in future. It is predicted that leakage will shortly be the most significant cost as it grows at about a 5× rate per generation. Thus, reducing leakage is essential for future processor design. Since large caches occupy most of the area, they are one of the leakiest structures in the chip and hence, a main source of energy consumption for future processors.This paper introduces IATAC (inter-access time per access count), a new hardware technique to reduce cache leakage for L2 caches. IATAC dynamically adapts the cache size to the program requirements turning off cache lines whose content is not likely to be reused. Our evaluation shows that this approach outperforms all previous state-of-the-art techniques. IATAC turns off 65p of the cache lines across different L2 cache configurations with a very small performance degradation of around 2p.

Journal Article
TL;DR: In this article, the complexity of finding an optimal placement of objects (or code) in the memory, in the sense that this placement reduces the number of cache misses during program execution to the minimum, was investigated.
Abstract: We investigate the complexity of finding an optimal placement of objects (or code) in the memory, in the sense that this placement reduces the number of cache misses during program execution to the minimum. We show that this problem is one of the toughest amongst the interesting NP optimization problems in computer science, in particular, suppose one is given a sequence of memory accesses and one has to place the data in the memory so as to minimize the number of cache misses for this sequence. We show that if P ≠ NP, then one cannot efficiently approximate the optimal solution even up to a very liberal approximation ratio. Thus, this problem joins the family of extremely inapproximable optimization problems. Two famous members in this family are minimum graph coloring and maximum clique. In light of this harsh lower bound only mild approximation ratios can be obtained. We provide an algorithm that can map arbitrary access sequences within such a mild ratio.Next, we study the information loss when compressing the access sequence, keeping only pairwise relations. We show that the reduced information hides the optimal solution and highlights solutions that are far from optimal. Furthermore, we show that even if one restricts his attention to pairwise information, finding a good placement is computationally difficult.

Proceedings ArticleDOI
12 Feb 2005
TL;DR: This work proposes and analyzes a memory hierarchy that uses a unified compression scheme encompassing the last-level on-chip cache, the off-chip memory channel, and off- chip main memory that achieves a peak improvement of 292%, compared to 165% and 83% for cache or bus compression alone.
Abstract: The memory system's large and growing contribution to system performance motivates more aggressive approaches to improving its efficiency. We propose and analyze a memory hierarchy that uses a unified compression scheme encompassing the last-level on-chip cache, the off-chip memory channel, and off-chip main memory. This scheme simultaneously increases the effective on-chip cache capacity, off-chip bandwidth, and main memory size, while avoiding compression and decompression overheads between levels. Simulations of the SPEC CPU2000 benchmarks using a 1MB cache and 128-byte blocks show an average speedup of 19%, while degrading performance by no more than 5%. The combined scheme achieves a peak improvement of 292%, compared to 165% and 83% for cache or bus compression alone. The compressed system generally provides even better performance as the block size is increased to 512 bytes.