scispace - formally typeset
Search or ask a question

Showing papers on "Cache pollution published in 2008"


Proceedings ArticleDOI
25 Oct 2008
TL;DR: This paper proposes Thread-Aware Dynamic Insertion Policy (TADIP), a adaptive insertion policy that can take into account the memory requirements of each of the concurrently executing applications and provides performance benefits similar to doubling the size of an LRU-managed cache.
Abstract: Chip Multiprocessors (CMPs) allow different applications to concurrently execute on a single chip. When applications with differing demands for memory compete for a shared cache, the conventional LRU replacement policy can significantly degrade cache performance when the aggregate working set size is greater than the shared cache. In such cases, shared cache performance can be significantly improved by preserving the entire working set of applications that can co-exist in the cache and preserving some portion of the working set of the remaining applications. This paper investigates the use of adaptive insertion policies to manage shared caches. We show that directly extending the recently proposed dynamic insertion policy (DIP) is inadequate for shared caches since DIP is unaware of the characteristics of individual applications. We propose Thread-Aware Dynamic Insertion Policy (TADIP) that can take into account the memory requirements of each of the concurrently executing applications. Our evaluation with multi-programmed workloads for 2-core, 4-core, 8-core, and 16-core CMPs show that a TADIP-managed shared cache improves overall throughput by as much as 94%, 64%, 26%, and 16% respectively (on average 14%, 18%, 15%, and 17%) over the baseline LRU policy. The performance benefit of TADIP is 2.6x compared to DIP and 1.3x compared to the recently proposed Utility-based Cache Partitioning (UCP) scheme. We also show that a TADIP-managed shared cache provides performance benefits similar to doubling the size of an LRU-managed cache. Furthermore, TADIP requires a total storage overhead of less than two bytes per core, does not require changes to the existing cache structure, and performs similar to LRU for LRU friendly workloads.

321 citations


Journal ArticleDOI
01 Jun 2008
TL;DR: Two architectural techniques are proposed that enable microprocessor caches (L1 and L2), to operate at low voltages despite very high memory cell failure rates, and enable a 40% voltage reduction, which reduces power by 85% and energy per instruction (EPI) by 53%.
Abstract: One of the most effective techniques to reduce a processor’s power consumption is to reduce supply voltage. However, reducing voltage in the context of manufacturing-induced parameter variations cancause many types of memory circuits to fail. As a result, voltage scaling is limited by a minimum voltage, often called Vccmin, beyond which circuits may not operate reliably. Large memory structures (e.g., caches) typically set Vccmin for the whole processor. In this paper, we propose two architectural techniques that enable microprocessor caches (L1and L2), to operate at low voltages despite very high memory cell failure rates. The Word-disable scheme combines two consecutive cache lines, to form a single cache line where only non-failing words are used. The Bit-fix scheme uses a quarter of the ways in a cache set to store positions and fix bits for failing bits in other ways of the set. During high voltage operation, both schemes allow use of the entire cache. During low voltage operation, they sacrifice cache capacity by 50% and 25%, respectively, to reduce Vccmin below 500mV. Compared to current designs with a Vccmin of 825mV, our schemes enable a 40% voltage reduction, which reduces power by 85% and energy per instruction (EPI) by 53%

289 citations


Proceedings ArticleDOI
08 Nov 2008
TL;DR: The results show that the proposed cache architecture has low miss rates comparable to a highly associative cache and short access times and power efficiency close to that of a direct-mapped cache, and can thwart cache-based software side-channel attacks, providing both legacy and security-enhanced software a much higher degree of security.
Abstract: Caches ideally should have low miss rates and short access times, and should be power efficient at the same time. Such design goals are often contradictory in practice. Recent findings on efficient attacks based on information leakage in caches have also brought the security issue up front. Design for security introduces even more restrictions and typically leads to significant performance degradation. This paper presents a novel cache architecture that can simultaneously achieve the above goals. Specifically, cache miss rates are reduced with dynamic remapping and longer cache indices, access-time overhead overcome with astute low-level circuit design, and information leakage thwarted by a security-aware cache replacement algorithm together with the performance enhancing mechanisms. We present both theoretical analysis and experimental results, using the SPEC2000 suite to evaluate the cache miss behavior, and CACTI and HSPICE to validate the circuit design. Our results show that the proposed cache architecture has low miss rates comparable to a highly associative cache and short access times and power efficiency close to that of a direct-mapped cache. At the same time it can thwart cache-based software side-channel attacks, providing both legacy and security-enhanced software a much higher degree of security. Additional benefits that the proposed cache architecture can bring, like fault tolerance and hot-spot mitigation, are also discussed briefly.

278 citations


Journal ArticleDOI
TL;DR: A new counter-based approach to deal with cache pollution, predicting lines that have become dead and replacing them early from the L2 cache and identifying never-reaccessed lines, which is augmented with an event counter that is incremented when an event of interest such as certain cache accesses occurs.
Abstract: Recent studies have shown that, in highly associative caches, the performance gap between the least recently used (LRU) and the theoretical optimal replacement algorithms is large, motivating the design of alternative replacement algorithms to improve cache performance. In LRU replacement, a line, after its last use, remains in the cache for a long time until it becomes the LRU line. Such deadlines unnecessarily reduce the cache capacity available for other lines. In addition, in multilevel caches, temporal reuse patterns are often inverted, showing in the L1 cache but, due to the filtering effect of the L1 cache, not showing in the L2 cache. At the L2, these lines appear to be brought in the cache but are never reaccessed until they are replaced. These lines unnecessarily pollute the L2 cache. This paper proposes a new counter-based approach to deal with the above problems. For the former problem, we predict lines that have become dead and replace them early from the L2 cache. For the latter problem, we identify never-reaccessed lines, bypass the L2 cache, and place them directly in the L1 cache. Both techniques are achieved through a single counter-based mechanism. In our approach, each line in the L2 cache is augmented with an event counter that is incremented when an event of interest such as certain cache accesses occurs. When the counter reaches a threshold, the line ";expires"; and becomes replaceable. Each line's threshold is unique and is dynamically learned. We propose and evaluate two new replacement algorithms: Access interval predictor (AIP) and live-time predictor (LvP). AIP and LvP speed up 10 capacity-constrained SPEC2000 benchmarks by up to 48 percent and 15 percent on average (7 percent on average for the whole 21 Spec2000 benchmarks). Cache bypassing further reduces L2 cache pollution and improves the average speedups to 17 percent (8 percent for the whole 21 Spec2000 benchmarks).

230 citations


Proceedings ArticleDOI
08 Nov 2008
TL;DR: This paper proposes a new class of dead-block predictors that predict dead blocks based on bursts of accesses to a cache block, and evaluates three ways to increase cache efficiency by eliminating dead blocks early: replacement optimization, bypassing, and prefetching.
Abstract: Data caches in general-purpose microprocessors often contain mostly dead blocks and are thus used inefficiently. To improve cache efficiency, dead blocks should be identified and evicted early. Prior schemes predict the death of a block immediately after it is accessed; however, these schemes yield lower prediction accuracy and coverage. Instead, we find that predicting the death of a block when it just moves out of the MRU position gives the best tradeoff between timeliness and prediction accuracy/coverage. Furthermore, the individual reference history of a block in the L1 cache can be irregular because of data/control dependence. This paper proposes a new class of dead-block predictors that predict dead blocks based on bursts of accesses to a cache block. A cache burst begins when a block becomes MRU and ends when it becomes non-MRU. Cache bursts are more predictable than individual references because they hide the irregularity of individual references. When used at the L1 cache, the best burst-based predictor can identify 96% of the dead blocks with a 96% accuracy. With the improved dead-block predictors, we evaluate three ways to increase cache efficiency by eliminating dead blocks early: replacement optimization, bypassing, and prefetching. The most effective approach, prefetching into dead blocks, increases the average L1 efficiency from 8% to 17% and the L2 efficiency from 17% to 27%. This increased cache efficiency translates into higher overall performance: prefetching into dead blocks outperforms the same prefetch scheme without dead-block prediction by 12% at the L1 and by 13% at the L2.

196 citations


Patent
14 Jul 2008
TL;DR: In this paper, cache delete priority assignment is performed from a position where a user finished playback based on whether the user intends to view the content later, and a cache delete inhibit span is determined based on a playback stop position or a normal speed playback time.
Abstract: In a cache control assuming plural user terminals accessing identical content, cache delete priority assignment is performed from a position where a user finished playback based on whether the user intends to view the content later. A cache control server is provided, and a cache delete inhibit span is determined based on a playback stop position or a normal speed playback time. A cache server deletes the cache based on the delete inhibit span received from the cache control server. Traffic of the core network due to re-cache can thus be reduced.

162 citations


Patent
28 Jul 2008
TL;DR: In this article, a method, system and program are described for accelerating data storage in a cache appliance cluster that transparently monitors NFS and CIFS traffic between clients and NAS subsystems and caches files in a multi-rank flash DIMM cache memory by pipelining multiple page write and page program operations to different flash memory ranks, thereby improving write speeds.
Abstract: A method, system and program are disclosed for accelerating data storage in a cache appliance cluster that transparently monitors NFS and CIFS traffic between clients and NAS subsystems and caches files in a multi-rank flash DIMM cache memory by pipelining multiple page write and page program operations to different flash memory ranks, thereby improving write speeds to the flash DIMM cache memory.

132 citations


Proceedings ArticleDOI
08 Nov 2008
TL;DR: ROCS, implemented in the Linux 2.6.24 kernel and running on a 2.3 GHz PowerPC 970FX, improves performance of memory-intensive SPEC CPU 2000 and NAS benchmarks by up to 34%, and 16% on average.
Abstract: It is well recognized that LRU cache-line replacement can be ineffective for applications with large working sets or non-localized memory access patterns. Specifically, in last-level processor caches, LRU can cause cache pollution by inserting non-reuseable elements into the cache while evicting reusable ones. The work presented in this paper addresses last-level cache pollution through a dynamic operating system mechanism, called ROCS, requiring no change to underlying hardware and no change to applications. ROCS employs hardware performance counters on a commodity processor to characterize application cache behavior at run-time. Using this online profiling, cache unfriendly pages are dynamically mapped to a pollute buffer in the cache, eliminating competition between reusable and non-reusable cache lines. The operating system implements the pollute buffer through a page-coloring based technique, by dedicating a small slice of the last-level cache to store non-reusable pages. Measurements show that ROCS, implemented in the Linux 2.6.24 kernel and running on a 2.3 GHz PowerPC 970FX, improves performance of memory-intensive SPEC CPU 2000 and NAS benchmarks by up to 34%, and 16% on average.

127 citations


Patent
05 Aug 2008
TL;DR: In this article, the authors dynamically analyze lookup requests from a cache look-up algorithm to look up data block tags corresponding to blocks of data previously inserted into a cache memory, to determine a cache related parameter.
Abstract: The present invention includes dynamically analyzing look-up requests from a cache look-up algorithm to look-up data block tags corresponding to blocks of data previously inserted into a cache memory, to determine a cache related parameter. After analysis of a specific look-up request, a block of data corresponding to the tag looked up by the look-up request may be accessed from the cache memory or from a mass storage device.

111 citations


Proceedings ArticleDOI
25 Aug 2008
TL;DR: In this work, the cache partitioning problem is presented as an optimization problem whose solution sets the size of each cache partition and assigns tasks to partitions such that system worst-case utilization is minimized thus increasing real-time schedulability.
Abstract: Cache partitioning techniques have been proposed in the past as a solution for the cache interference problem. Due to qualitative differences with general purpose platforms, real-time embedded systems need to minimize task real-time utilization (function of execution time and period) instead of only minimizing the number of cache misses. In this work, the partitioning problem is presented as an optimization problem whose solution sets the size of each cache partition and assigns tasks to partitions such that system worst-case utilization is minimized thus increasing real-time schedulability. Since the problem is NP-Hard, a genetic algorithm is presented to find a near optimal solution. A case study and experiments show that in a typical real-time embedded system, the proposed algorithm is able to reduce the worst-case utilization by 15% (on average) if compared to the case when the system uses a shared cache or a proportional cache partitioned environment.

109 citations


Patent
16 Jan 2008
TL;DR: In this paper, a cache appliance cluster that transparently monitors NFS and CIFS traffic between clients and NAS subsystems and caches files using dynamically adjustable cache policies provides low-latency access and redundancy in responding to both read and write requests for cached files, thereby improving access time to the data stored on the disk-based NAS filer (group).
Abstract: A method, system and program are disclosed for accelerating data storage by providing non-disruptive storage caching using clustered cache appliances with packet inspection intelligence. A cache appliance cluster that transparently monitors NFS and CIFS traffic between clients and NAS subsystems and caches files using dynamically adjustable cache policies provides low-latency access and redundancy in responding to both read and write requests for cached files, thereby improving access time to the data stored on the disk-based NAS filer (group).

Patent
30 May 2008
TL;DR: In this paper, various technologies and techniques are disclosed for providing a bounded transactional memory application that accesses cache metadata in a cache of a central processing unit. And the application can also interrogate a cache line metadata eviction summary to determine whether a transaction is doomed and then take an appropriate action.
Abstract: Various technologies and techniques are disclosed for providing a bounded transactional memory application that accesses cache metadata in a cache of a central processing unit. When performing a transactional read from the bounded transactional memory application, a cache line metadata transaction-read bit is set. When performing a transactional write from the bounded transactional memory application, a cache line metadata transaction-write bit is set and a conditional store is performed. At commit time, if any lines marked with the transaction-read bit or the transaction-write bit were evicted or invalidated, all speculatively written lines are discarded. The application can also interrogate a cache line metadata eviction summary to determine whether a transaction is doomed and then take an appropriate action.

Proceedings ArticleDOI
30 Nov 2008
TL;DR: This paper proposes a safe static instruction cache analysis method for multi-level non-inclusive caches, and shows that in all cases WCET estimations are much tighter when considering the cache hierarchy than when considering only the L1 cache.
Abstract: With the advent of increasingly complex hardware in real-time embedded systems (processors with performance enhancing features such as pipelines, cache hierarchy, multiple cores), many processors now have a set-associative L2 cache. Thus, there is a need for considering cache hierarchies when validating the temporal behavior of real-time systems, in particular when estimating tasks' worst-case execution times (WCETs). In this paper, we propose a safe static instruction cache analysis method for multi-level non-inclusive caches. The proposed method is experimented on medium-size and large programs. We show that the method is reasonably tight. We further show that in all cases WCET estimations are much tighter when considering the cache hierarchy than when considering only the L1 cache. An evaluation of the analysis time is conducted, demonstrating that analyzing the cache hierarchy has a reasonable computation time.

Proceedings ArticleDOI
31 Oct 2008
TL;DR: This paper analyzes two new cache designs to defeat cache-based side channel attacks by eliminating/obfuscating cache interferences and identifies significant vulnerabilities and shortcomings.
Abstract: Software cache-based side channel attacks present a serious tthreat to computer systems. Previously proposed countermeasures were either too costly for practical use or only effective against particular attacks. Thus, a recent work identified cache interferences in general as the root cause and proposed two new cache designs, namely partition-locked cache (PLcache) and random permutation cache(RPcache), to defeat cache-based side channel attacks by eliminating/obfuscating cache interferences. In this paper, we analyze these new cache designs and identify significant vulnerabilities and shortcomings of those new cache designs. We also propose possible solutions and improvements over the original new cache designs to overcome the identified shortcomings.

Patent
16 Jan 2008
TL;DR: In this paper, a method, system and program are disclosed for accelerating data storage in a cache appliance cluster that transparently monitors NFS and CIFS traffic between clients and NAS subsystems and caches files using dynamically adjustable cache policies.
Abstract: A method, system and program are disclosed for accelerating data storage in a cache appliance cluster that transparently monitors NFS and CIFS traffic between clients and NAS subsystems and caches files using dynamically adjustable cache policies which populate the storage cache using behavioral adaptive policies that are based on analysis of clients-filers transaction patterns and network utilization, thereby improving access time to the data stored on the disk-based NAS filer (group) for predetermined applications.

Journal ArticleDOI
TL;DR: This article views "retention time" in a new way by using statistical populations more appropriate for caches, and suggests uses of a cache's inherent error- control mechanisms to reduce refresh rates by several orders of magnitude.
Abstract: Caches use data very differently than main memory does, so DRAM caches can have dramatically different refresh requirements. Making canonical assumptions about retention times in DRAM can be drastic overkill within the cache context. Using standard refresh rates may be unnecessary, and can be a significant waste of cache utilization and power. In this article, we view "retention time" in a new way by using statistical populations more appropriate for caches, and we suggest uses of a cache's inherent error- control mechanisms to reduce refresh rates by several orders of magnitude.

Patent
Naoki Moritoki1
02 Jan 2008
TL;DR: In this paper, the authors present a storage apparatus and its data management method capable of preventing the loss of data retained in a volatile cache memory even during an unexpected power shutdown, provided that the storage apparatus includes a cache memory configured from a volatile and nonvolatile memory.
Abstract: Provided are a storage apparatus and its data management method capable of preventing the loss of data retained in a volatile cache memory even during an unexpected power shutdown. This storage apparatus includes a cache memory configured from a volatile and nonvolatile memory. The volatile cache memory caches data according to a write request from a host system and data staged from a disk drive, and the nonvolatile cache memory only caches data staged from a disk drive. Upon an unexpected power shutdown, the storage apparatus immediately backs up the dirty data and other information cached in the volatile cache memory to the nonvolatile cache memory.

Proceedings Article
Binny S. Gill1
26 Feb 2008
TL;DR: This work proposes a dramatically better performing alternative called PROMOTE, which provides exclusive caching in multi-level cache hierarchies without demotions or any of the overheads inherent in DEMOTE, and discovers theoretical bounds for optimal multi- level cache performance.
Abstract: Multi-level cache hierarchies have become very common; however, most cache management policies result in duplicating the same data redundantly on multiple levels. The state-of-the-art exclusive caching techniques used to mitigate this wastage in multi-level cache hierarchies are the DEMOTE technique and its variants. While these achieve good hit ratios, they suffer from significant I/O and computational overheads, making them unsuitable for deployment in real-life systems. We propose a dramatically better performing alternative called PROMOTE, which provides exclusive caching in multi-level cache hierarchies without demotions or any of the overheads inherent in DEMOTE. PROMOTE uses an adaptive probabilistic filtering technique to decide which pages to "promote" to caches closer to the application. While both DEMOTE and PROMOTE provide the same aggregate hit ratios, PROMOTE achieves more hits in the highest cache levels leading to better response times. When inter-cache bandwidth is limited, PROMOTE convincingly outperforms DEMOTE by being 2x more efficient in bandwidth usage. For example, in a trace from a real-life scenario, PROMOTE provided an average response time of 3.42ms as compared to 5.05ms for DEMOTE on a two-level hierarchy of LRU caches, and 5.93ms as compared to 7.57ms on a three-level cache hierarchy. We also discover theoretical bounds for optimal multi-level cache performance. We devise two offline policies, called OPT-UB and OPT-LB, that provably serve as upper and lower bounds on the theoretically optimal performance of multi-level cache hierarchies. For a series of experiments on a wide gamut of traces and cache sizes OPT-UB and OPTLB ran within 2.18% and 2.83% of each other for two-cache and three-cache hierarchies, respectively. These close bounds will help evaluate algorithms and guide future improvements in the field of multi-level caching.

Proceedings ArticleDOI
04 May 2008
TL;DR: The experimental results show that the PRAM based cache architectures achieve close to 80% reduction in the leakage energy consumption of a L1-L2 cache hierarchy.
Abstract: Sub-threshold leakage in SRAM based cache memories is becoming a predominant source of power consumption in deep-sub micron CMOS designs. Phase Change Random Access Memory (PRAM), a high density, fast access, non-volatile memory is being considered as a candidate for future universal memory technologies. In this paper, we investigate the architectural challenges in integrating a PRAM based memory into the conventional cache hierarchy. First, we develop PRAM cache delay and energy models. We then propose a hybrid PRAM architecture for L1 instruction caches on embedded processors. We also propose a PRAM based unified cache architecture for L2 caches on high-end microprocessors. Finally, we evaluate the proposed architectures, in terms of area, performance, and energy. The experimental results show that the PRAM based cache architectures achieve close to 80% reduction in the leakage energy consumption of a L1-L2 cache hierarchy.

Patent
06 Jun 2008
TL;DR: In this article, a shared code caching engine receives native code comprising at least a portion of a single module of the application program, and stores runtime data corresponding to the native code in a cache data file in the non-volatile memory.
Abstract: Computer code from an application program comprising a plurality of modules that each comprise a separately loadable file is code cached in a shared and persistent caching system. A shared code caching engine receives native code comprising at least a portion of a single module of the application program, and stores runtime data corresponding to the native code in a cache data file in the non-volatile memory. The engine then converts cache data file into a code cache file and enables the code cache file to be pre-loaded as a runtime code cache. These steps are repeated to store a plurality of separate code cache files at different locations in non-volatile memory.

Proceedings ArticleDOI
06 Apr 2008
TL;DR: This work proposes a method to prefetch irregular references accessed through a software cache that is built upon hardware such as Cell, and finds that when applicable, this prefetching can improve the performance of some benchmarks by 2 times on average, and by close to 4 times in the best case.
Abstract: The IBM Single Source Research Compiler for the Cell processor (the SSC Research Compiler) was developed to manage the complexity of programming the heterogeneous multicore Cell processor. The compiler accepts conventional source programs as input, and automatically generates binaries that execute on both the PPU and SPU cores available on a Cell chip. The compiler uses a software cache and direct buffers to manage data in the small local memory of SPUs. However, irregular references, such as a[ind[i]], often become performance bottle-necks. These references are accessed through software cache, usually with high miss rates. To solve this problem, we propose a method to prefetch irregular references accessed through a software cache that is built upon hardware such as Cell. This method includes code transformation in the compiler and a runtime library component for the software cache. Our design simplifies the synchronization required when prefetching into software cache, overlaps DMA operations for misses, and avoids frequent context switching to the miss handler. It also minimizes the cache pollution caused by prefetching, by looking both forwards and backwards through the sequence of addresses to be prefetched. We evaluated our prefetching method using the NAS benchmarks. We found that when applicable, our prefetching can improve the performance of some benchmarks by 2 times on average, and by close to 4 times in the best case. We also present data to show the impact of different configurations and optimizations when prefetching in a software cache.

Proceedings Article
26 Feb 2008
TL;DR: TaP is a storage cache sequential prefetching and caching technique to improve the read-ahead cache hit rate and system response time and the use of a table to detect sequential access patterns in the I/O workload and to dynamically determine the optimum prefetch cache size.
Abstract: TaP is a storage cache sequential prefetching and caching technique to improve the read-ahead cache hit rate and system response time. A unique feature of TaP is the use of a table to detect sequential access patterns in the I/O workload and to dynamically determine the optimum prefetch cache size. When compared to some popular prefetching techniques, TaP gives a better hit rate and response time while using a read cache that is often an order of magnitude smaller than that needed by other techniques. TaP is especially efficient when the I/O workload consists of interleaved requests from various applications, where only some of the applications are accessing their data sequentially. For example, TaP achieves the same hit rate as the other techniques with a cache length that is 100 times smaller than the cache needed by other techniques when the interleaved workload consists of 10% sequential application data and 90% random application data.

Journal ArticleDOI
31 Aug 2008
TL;DR: It is discovered that the miss stream from one single cache is approximated well by the superposition of a number of asymptotically independent renewal processes, which is likely to enable the development of a rigorous analysis of the tandem cache performance.
Abstract: Renewed interest in caching systems stems from their wide-spread use for reducing the document download latency over the Internet. Since caches are usually organized in a hierarchical manner, it is important to study the performance properties of tandem caches. The first step in understanding this problem is to characterize the miss stream from one single cache since it represents the input to the next level cache. In this regard, we discover that the miss stream from one single cache is approximated well by the superposition of a number of asymptotically independent renewal processes. Interestingly, when this weakly correlated miss sequence is fed into another cache, this barely observable correlation can lead to measurably different caching performance when compared to the independent reference model. This result is likely to enable the development of a rigorous analysis of the tandem cache performance.

Journal ArticleDOI
01 Aug 2008
TL;DR: This paper introduces Ferdinand, the first proxy-based cooperative query result cache with fully distributed consistency management, and implements a fully functioning Ferdinand prototype and evaluates its performance compared to several alternative query-caching approaches, showing that high cache hit rate and consistency management are both critical for Ferdinand's performance gains over existing systems.
Abstract: The backend database system is often the performance bottleneck when running web applications. A common approach to scale the database component is query result caching, but it faces the challenge of maintaining a high cache hit rate while efficiently ensuring cache consistency as the database is updated. In this paper we introduce Ferdinand, the first proxy-based cooperative query result cache with fully distributed consistency management. To maintain a high cache hit rate, Ferdinand uses both a local query result cache on each proxy server and a distributed cache. Consistency management is implemented with a highly scalable publish/subscribe system. We implement a fully functioning Ferdinand prototype and evaluate its performance compared to several alternative query-caching approaches, showing that our high cache hit rate and consistency management are both critical for Ferdinand's performance gains over existing systems.

Journal ArticleDOI
TL;DR: This work uses a cycle-based processor simulator, enhanced with the required modifications, in order to evaluate the use of the Cache Decay approach as an aid to guard against cache-based side channel attacks, and shows that the technique can be used effectively to protect against Cache Decay attacks.
Abstract: Side channel cryptanalysis has received significant attention lately, because it provides a low-cost and facile way to reveal the secret information held on a secure computing system. One particular type of side channel attacks, called cache-based side channel attacks, aims to deduce information about the state of a cryptographic algorithm or its key by observing the data-dependent behavior of a microprocessor's cache memory. These attacks have been proven successful and very hard to protect against. In this paper, we introduce the use of the Cache Decay approach as an aid to guard against cache-based side channel attacks. Cache Decay controls the lifetime (called decay interval) of the cache items and was initially proposed for cache power leakage savings. By randomly selecting the decay interval of the cache, we actually create caches with non-deterministic behavior in regard to their statistics. Thus, as we demonstrate, multiple runs of the same algorithm (performing on the same input) will result in different cache statistics, defending against the attacker and reinforcing the protection offered by the system. In our work, we use a cycle-based processor simulator, enhanced with the required modifications, in order to evaluate our proposal and show that our technique can be used effectively to protect against cache-based side channel attacks.

Proceedings ArticleDOI
25 Oct 2008
TL;DR: A hierarchical, hybrid software-cache architecture that classifies at compile time memory accesses in two classes, high-locality and irregular, is proposed that can be a competitive alternative to a modern server-class multi-core such as the IBM Power5 processor for a set of parallel NAS applications.
Abstract: Ease of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this paper, we propose a hierarchical, hybrid software-cache architecture that classifies at compile time memory accesses in two classes, high-locality and irregular. Our approach then steers the memory references toward one of two specific cache structures optimized for their respective access pattern. The specific cache structures are optimized to enable high-level compiler optimizations to aggressively unroll loops, reorder cache references, and/or transform surrounding loops so as to practically eliminate the software cache overhead in the innermost loop. Performance evaluation indicates that improvements due to the optimized software-cache structures combined with the proposed code-optimizations translate into 3.5 to 8.4 speedup factors, compared to a traditional software cache approach. As a result, we demonstrate that the Cell BE processor can be a competitive alternative to a modern server-class multi-core such as the IBM Power5 processor for a set of parallel NAS applications.

Patent
14 Aug 2008
TL;DR: In this paper, a state machine model of managing cache blocks in a storage controller cache memory maintains blocks in the storage controller's cache memory in a new state until verification is sensed that the blocks have been successfully stored on the persistent storage media of the affected disk drives.
Abstract: Methods and associated structures for utilizing write-back cache management modes for local cache memory of disk drives coupled to a storage controller while maintaining data integrity of the data transferred to the local cache memories of affected disk drives. In one aspect hereof, a state machine model of managing cache blocks in a storage controller cache memory maintains blocks in the storage controller's cache memory in a new state until verification is sensed that the blocks have been successfully stored on the persistent storage media of the affected disk drives. Responsive to failure or other reset of the disk drive, the written cache blocks may be re-written from the copy maintained in the cache memory of the storage controller. In another aspect, an alternate controller's cache memory may also be used to mirror the cache blocks from the primary storage controller's cache memory as additional data integrity assurance.

Proceedings ArticleDOI
08 Nov 2008
TL;DR: It is shown how proper sizing of memory cells can guarantee that the memory cell reliability in the near threshold supply voltage region matches that of a standard memory cell, however, this robustness comes with a significant area cost.
Abstract: Battery life is an important concern for modern embedded processors. Supply voltage scaling techniques can provide an order of magnitude reduction in energy. Current commercial memory technologies have been limited in the degree of supply voltage scaling that can be performed if they are to meet yield and reliability constraints. This has limited designers from exploring the near threshold operating regions for embedded processors. Summarizing prior work we show how proper sizing of memory cells can guarantee that the memory cell reliability in the near threshold supply voltage region matches that of a standard memory cell. However, this robustness comes with a significant area cost. We show how to employ these cells to build cache architectures that greatly reduce energy consumption. We propose an embedded processor based on these new cache architectures that operates in a low power mode, with minimal impact on full performance runtime. The proposed cache uses near threshold tolerant cache ways to reduce access energy combined with traditional cache ways to maintain performance. The access policy of the cache ways is then dynamically reconfigured to obtain energy efficient performance while minimally impacting the high performance mode runtime. Using near threshold cache architectures we show an energy reduction of 53% over a traditional filter cache. For the MIBench embedded benchmarks we show on average an 86% (7.3times) reduction in energy while in low power (10 MHz) mode with only an average 2% increase in runtime in high performance (400 MHz) mode. And for SpecInt applications we show a 77% (4.4times) reduction in energy in low power mode with only an average 4.8% increase in runtime for high performance mode. In addition we show that these trends hold from 130 nm to 45 nm technology nodes.

Patent
19 Feb 2008
TL;DR: In this article, the authors propose an apparatus and method to dynamically allocate cache in a SAN controller between a first fixed cache comprising traditional RAID cache comprised of RAM and a second scalable RAID cache comprising of SSDs (Solid State Devices).
Abstract: An apparatus and method to dynamically allocate cache in a SAN controller between a first fixed cache comprising traditional RAID cache comprised of RAM and a second, scalable RAID cache comprising of SSDs (Solid State Devices). The method is dynamic and switches between the first and second cache depending on IO demand.

Patent
Daniel Ellard1
18 Mar 2008
TL;DR: In this article, a network storage server includes a main buffer cache to buffer writes requested by clients before committing them to primary persistent storage, and a secondary cache, implemented as low-cost, solid-state memory, such as flash memory, to store data evicted from the main buffer caches or read from the primary persistent caches.
Abstract: A network storage server includes a main buffer cache to buffer writes requested by clients before committing them to primary persistent storage. The server further uses a secondary cache, implemented as low-cost, solid-state memory, such as flash memory, to store data evicted from the main buffer cache or data read from the primary persistent storage. To prevent bursts of writes to the secondary cache, data is copied from the main buffer cache to the secondary cache speculatively, before there is a need to evict data from the main buffer cache. Data can be copied to the secondary cache as soon as the data is marked as clean in the main buffer cache. Data can be written to secondary cache at a substantially constant rate, which can be at or close to the maximum write rate of the secondary cache.