Showing papers on "Cache invalidation published in 2008"

PDF

Open Access

Proceedings Article•DOI•

Adaptive insertion policies for managing shared caches

[...]

Aamer Jaleel¹, William C. Hasenplaugh¹, Moinuddin K. Qureshi², Julien Sebot¹, Simon C. Steely¹, Joel Emer¹ - Show less +2 more•Institutions (2)

Intel¹, IBM²

25 Oct 2008

TL;DR: This paper proposes Thread-Aware Dynamic Insertion Policy (TADIP), a adaptive insertion policy that can take into account the memory requirements of each of the concurrently executing applications and provides performance benefits similar to doubling the size of an LRU-managed cache.

...read moreread less

Abstract: Chip Multiprocessors (CMPs) allow different applications to concurrently execute on a single chip. When applications with differing demands for memory compete for a shared cache, the conventional LRU replacement policy can significantly degrade cache performance when the aggregate working set size is greater than the shared cache. In such cases, shared cache performance can be significantly improved by preserving the entire working set of applications that can co-exist in the cache and preserving some portion of the working set of the remaining applications. This paper investigates the use of adaptive insertion policies to manage shared caches. We show that directly extending the recently proposed dynamic insertion policy (DIP) is inadequate for shared caches since DIP is unaware of the characteristics of individual applications. We propose Thread-Aware Dynamic Insertion Policy (TADIP) that can take into account the memory requirements of each of the concurrently executing applications. Our evaluation with multi-programmed workloads for 2-core, 4-core, 8-core, and 16-core CMPs show that a TADIP-managed shared cache improves overall throughput by as much as 94%, 64%, 26%, and 16% respectively (on average 14%, 18%, 15%, and 17%) over the baseline LRU policy. The performance benefit of TADIP is 2.6x compared to DIP and 1.3x compared to the recently proposed Utility-based Cache Partitioning (UCP) scheme. We also show that a TADIP-managed shared cache provides performance benefits similar to doubling the size of an LRU-managed cache. Furthermore, TADIP requires a total storage overhead of less than two bytes per core, does not require changes to the existing cache structure, and performs similar to LRU for LRU friendly workloads.

...read moreread less

321 citations

Proceedings Article•DOI•

A novel cache architecture with enhanced performance and security

[...]

Zhenghong Wang¹, Ruby B. Lee¹•Institutions (1)

Princeton University¹

08 Nov 2008

TL;DR: The results show that the proposed cache architecture has low miss rates comparable to a highly associative cache and short access times and power efficiency close to that of a direct-mapped cache, and can thwart cache-based software side-channel attacks, providing both legacy and security-enhanced software a much higher degree of security.

...read moreread less

Abstract: Caches ideally should have low miss rates and short access times, and should be power efficient at the same time. Such design goals are often contradictory in practice. Recent findings on efficient attacks based on information leakage in caches have also brought the security issue up front. Design for security introduces even more restrictions and typically leads to significant performance degradation. This paper presents a novel cache architecture that can simultaneously achieve the above goals. Specifically, cache miss rates are reduced with dynamic remapping and longer cache indices, access-time overhead overcome with astute low-level circuit design, and information leakage thwarted by a security-aware cache replacement algorithm together with the performance enhancing mechanisms. We present both theoretical analysis and experimental results, using the SPEC2000 suite to evaluate the cache miss behavior, and CACTI and HSPICE to validate the circuit design. Our results show that the proposed cache architecture has low miss rates comparable to a highly associative cache and short access times and power efficiency close to that of a direct-mapped cache. At the same time it can thwart cache-based software side-channel attacks, providing both legacy and security-enhanced software a much higher degree of security. Additional benefits that the proposed cache architecture can bring, like fault tolerance and hot-spot mitigation, are also discussed briefly.

...read moreread less

278 citations

Journal Article•DOI•

Counter-Based Cache Replacement and Bypassing Algorithms

[...]

Mazen Kharbutli¹, Yan Solihin²•Institutions (2)

Jordan University of Science and Technology¹, North Carolina State University²

01 Apr 2008-IEEE Transactions on Computers

TL;DR: A new counter-based approach to deal with cache pollution, predicting lines that have become dead and replacing them early from the L2 cache and identifying never-reaccessed lines, which is augmented with an event counter that is incremented when an event of interest such as certain cache accesses occurs.

...read moreread less

Abstract: Recent studies have shown that, in highly associative caches, the performance gap between the least recently used (LRU) and the theoretical optimal replacement algorithms is large, motivating the design of alternative replacement algorithms to improve cache performance. In LRU replacement, a line, after its last use, remains in the cache for a long time until it becomes the LRU line. Such deadlines unnecessarily reduce the cache capacity available for other lines. In addition, in multilevel caches, temporal reuse patterns are often inverted, showing in the L1 cache but, due to the filtering effect of the L1 cache, not showing in the L2 cache. At the L2, these lines appear to be brought in the cache but are never reaccessed until they are replaced. These lines unnecessarily pollute the L2 cache. This paper proposes a new counter-based approach to deal with the above problems. For the former problem, we predict lines that have become dead and replace them early from the L2 cache. For the latter problem, we identify never-reaccessed lines, bypass the L2 cache, and place them directly in the L1 cache. Both techniques are achieved through a single counter-based mechanism. In our approach, each line in the L2 cache is augmented with an event counter that is incremented when an event of interest such as certain cache accesses occurs. When the counter reaches a threshold, the line ";expires"; and becomes replaceable. Each line's threshold is unique and is dynamically learned. We propose and evaluate two new replacement algorithms: Access interval predictor (AIP) and live-time predictor (LvP). AIP and LvP speed up 10 capacity-constrained SPEC2000 benchmarks by up to 48 percent and 15 percent on average (7 percent on average for the whole 21 Spec2000 benchmarks). Cache bypassing further reduces L2 cache pollution and improves the average speedups to 17 percent (8 percent for the whole 21 Spec2000 benchmarks).

...read moreread less

230 citations

Patent•

Targeted Caching to Reduce Bandwidth Consumption

[...]

Alan L. Glasser¹•Institutions (1)

AT&T¹

27 Aug 2008

TL;DR: In this article, a system includes a name server, an edge cache server and a local cache server, which is configured to respond to the anycast IP address and a unicast IP address.

...read moreread less

Abstract: A system includes a name server, an edge cache server, and a local cache server. The name server is configured to provide an anycast IP address in response to a request for an IP address of an origin hostname from a client system. The edge cache server is configured to respond to the anycast IP address and a unicast IP address and to retrieve content from an origin. The local cache server includes a storage and is configured to respond to the anycast IP address, to retrieve content from the edge cache server, and provide the content to a client system.

...read moreread less

221 citations

Proceedings Article•DOI•

Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency

[...]

Haiming Liu¹, Michael Ferdman², Jaehyuk Huh³, Doug Burger⁴•Institutions (4)

University of Texas at Austin¹, Carnegie Mellon University², Advanced Micro Devices³, Microsoft⁴

08 Nov 2008

TL;DR: This paper proposes a new class of dead-block predictors that predict dead blocks based on bursts of accesses to a cache block, and evaluates three ways to increase cache efficiency by eliminating dead blocks early: replacement optimization, bypassing, and prefetching.

...read moreread less

Abstract: Data caches in general-purpose microprocessors often contain mostly dead blocks and are thus used inefficiently. To improve cache efficiency, dead blocks should be identified and evicted early. Prior schemes predict the death of a block immediately after it is accessed; however, these schemes yield lower prediction accuracy and coverage. Instead, we find that predicting the death of a block when it just moves out of the MRU position gives the best tradeoff between timeliness and prediction accuracy/coverage. Furthermore, the individual reference history of a block in the L1 cache can be irregular because of data/control dependence. This paper proposes a new class of dead-block predictors that predict dead blocks based on bursts of accesses to a cache block. A cache burst begins when a block becomes MRU and ends when it becomes non-MRU. Cache bursts are more predictable than individual references because they hide the irregularity of individual references. When used at the L1 cache, the best burst-based predictor can identify 96% of the dead blocks with a 96% accuracy. With the improved dead-block predictors, we evaluate three ways to increase cache efficiency by eliminating dead blocks early: replacement optimization, bypassing, and prefetching. The most effective approach, prefetching into dead blocks, increases the average L1 efficiency from 8% to 17% and the L2 efficiency from 17% to 27%. This increased cache efficiency translates into higher overall performance: prefetching into dead blocks outperforms the same prefetch scheme without dead-block prediction by 12% at the L1 and by 13% at the L2.

...read moreread less

196 citations

Patent•

Content delivery system, cache server, and cache control server

[...]

Kenji Fujihira¹, Daisuke Matsubara¹, Yukiko Takeda¹•Institutions (1)

Hitachi¹

14 Jul 2008

TL;DR: In this paper, cache delete priority assignment is performed from a position where a user finished playback based on whether the user intends to view the content later, and a cache delete inhibit span is determined based on a playback stop position or a normal speed playback time.

...read moreread less

Abstract: In a cache control assuming plural user terminals accessing identical content, cache delete priority assignment is performed from a position where a user finished playback based on whether the user intends to view the content later. A cache control server is provided, and a cache delete inhibit span is determined based on a playback stop position or a normal speed playback time. A cache server deletes the cache based on the delete inhibit span received from the cache control server. Traffic of the core network due to re-cache can thus be reduced.

...read moreread less

162 citations

Proceedings Article•

Provably good multicore cache performance for divide-and-conquer algorithms

[...]

Guy E. Blelloch¹, Rezaul Chowdhury², Phillip B. Gibbons³, Vijaya Ramachandran², Shimin Chen³, Michael Kozuch³ - Show less +2 more•Institutions (3)

Carnegie Mellon University¹, University of Texas at Austin², Intel³

20 Jan 2008

TL;DR: It is shown that a separator-based algorithm for sparse-matrix-dense-vector-multiply achieves provably good cache performance in the multicore-cache model, as well as in the well-studied sequential cache-oblivious model.

...read moreread less

Abstract: This paper presents a multicore-cache model that reflects the reality that multicore processors have both per-processor private (L1) caches and a large shared (L2) cache on chip. We consider a broad class of parallel divide-and-conquer algorithms and present a new on-line scheduler, CONTROLLED-PDF, that is competitive with the standard sequential scheduler in the following sense. Given any dynamically unfolding computation DAG from this class of algorithms, the cache complexity on the multicore-cache model under our new scheduler is within a constant factor of the sequential cache complexity for both L1 and L2, while the time complexity is within a constant factor of the sequential time complexity divided by the number of processors p. These are the first such asymptotically-optimal results for any multicore model. Finally, we show that a separator-based algorithm for sparse-matrix-dense-vector-multiply achieves provably good cache performance in the multicore-cache model, as well as in the well-studied sequential cache-oblivious model.

...read moreread less

127 citations

Proceedings Article•DOI•

Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer

[...]

Livio Soares¹, David Tam¹, Michael Stumm¹•Institutions (1)

University of Toronto¹

08 Nov 2008

TL;DR: ROCS, implemented in the Linux 2.6.24 kernel and running on a 2.3 GHz PowerPC 970FX, improves performance of memory-intensive SPEC CPU 2000 and NAS benchmarks by up to 34%, and 16% on average.

...read moreread less

Abstract: It is well recognized that LRU cache-line replacement can be ineffective for applications with large working sets or non-localized memory access patterns. Specifically, in last-level processor caches, LRU can cause cache pollution by inserting non-reuseable elements into the cache while evicting reusable ones. The work presented in this paper addresses last-level cache pollution through a dynamic operating system mechanism, called ROCS, requiring no change to underlying hardware and no change to applications. ROCS employs hardware performance counters on a commodity processor to characterize application cache behavior at run-time. Using this online profiling, cache unfriendly pages are dynamically mapped to a pollute buffer in the cache, eliminating competition between reusable and non-reusable cache lines. The operating system implements the pollute buffer through a page-coloring based technique, by dedicating a small slice of the last-level cache to store non-reusable pages. Measurements show that ROCS, implemented in the Linux 2.6.24 kernel and running on a 2.3 GHz PowerPC 970FX, improves performance of memory-intensive SPEC CPU 2000 and NAS benchmarks by up to 34%, and 16% on average.

...read moreread less

127 citations

Patent•

Dynamic optimization of cache memory

[...]

Naveen Bali, Naresh M. Patel, Yasuhiro Endo

05 Aug 2008

TL;DR: In this article, the authors dynamically analyze lookup requests from a cache look-up algorithm to look up data block tags corresponding to blocks of data previously inserted into a cache memory, to determine a cache related parameter.

...read moreread less

Abstract: The present invention includes dynamically analyzing look-up requests from a cache look-up algorithm to look-up data block tags corresponding to blocks of data previously inserted into a cache memory, to determine a cache related parameter. After analysis of a specific look-up request, a block of data corresponding to the tag looked up by the look-up request may be accessed from the cache memory or from a mass storage device.

...read moreread less

111 citations

Proceedings Article•DOI•

Impact of Cache Partitioning on Multi-tasking Real Time Embedded Systems

[...]

Bach D. Bui¹, Marco Caccamo¹, Lui Sha¹, J. Martinez²•Institutions (2)

University of Illinois at Urbana–Champaign¹, Lockheed Martin Corporation²

25 Aug 2008

TL;DR: In this work, the cache partitioning problem is presented as an optimization problem whose solution sets the size of each cache partition and assigns tasks to partitions such that system worst-case utilization is minimized thus increasing real-time schedulability.

...read moreread less

Abstract: Cache partitioning techniques have been proposed in the past as a solution for the cache interference problem. Due to qualitative differences with general purpose platforms, real-time embedded systems need to minimize task real-time utilization (function of execution time and period) instead of only minimizing the number of cache misses. In this work, the partitioning problem is presented as an optimization problem whose solution sets the size of each cache partition and assigns tasks to partitions such that system worst-case utilization is minimized thus increasing real-time schedulability. Since the problem is NP-Hard, a genetic algorithm is presented to find a near optimal solution. A case study and experiments show that in a typical real-time embedded system, the proposed algorithm is able to reduce the worst-case utilization by 15% (on average) if compared to the case when the system uses a shared cache or a proportional cache partitioned environment.

...read moreread less

109 citations

Patent•

Clustered cache appliance system and methodology

[...]

Joaquin J. Aviles, Mark U. Cree, Gregory Dahl¹•Institutions (1)

NetApp¹

16 Jan 2008

TL;DR: In this paper, a cache appliance cluster that transparently monitors NFS and CIFS traffic between clients and NAS subsystems and caches files using dynamically adjustable cache policies provides low-latency access and redundancy in responding to both read and write requests for cached files, thereby improving access time to the data stored on the disk-based NAS filer (group).

...read moreread less

Abstract: A method, system and program are disclosed for accelerating data storage by providing non-disruptive storage caching using clustered cache appliances with packet inspection intelligence. A cache appliance cluster that transparently monitors NFS and CIFS traffic between clients and NAS subsystems and caches files using dynamically adjustable cache policies provides low-latency access and redundancy in responding to both read and write requests for cached files, thereby improving access time to the data stored on the disk-based NAS filer (group).

...read moreread less

Patent•

Cache metadata for implementing bounded transactional memory

[...]

Jan Gray¹, Tim Harris¹, James R. Larus¹, Burton Smith¹•Institutions (1)

Microsoft¹

30 May 2008

TL;DR: In this paper, various technologies and techniques are disclosed for providing a bounded transactional memory application that accesses cache metadata in a cache of a central processing unit. And the application can also interrogate a cache line metadata eviction summary to determine whether a transaction is doomed and then take an appropriate action.

...read moreread less

Abstract: Various technologies and techniques are disclosed for providing a bounded transactional memory application that accesses cache metadata in a cache of a central processing unit. When performing a transactional read from the bounded transactional memory application, a cache line metadata transaction-read bit is set. When performing a transactional write from the bounded transactional memory application, a cache line metadata transaction-write bit is set and a conditional store is performed. At commit time, if any lines marked with the transaction-read bit or the transaction-write bit were evicted or invalidated, all speculatively written lines are discarded. The application can also interrogate a cache line metadata eviction summary to determine whether a transaction is doomed and then take an appropriate action.

...read moreread less

Patent•

Smart Data Caching Using Data Mining

[...]

Jo Arao Ramos¹, John Baxter Rollins¹, David G. Wilhite¹•Institutions (1)

IBM¹

07 Jan 2008

TL;DR: In this article, a data mining algorithm is applied to the collected data requests to predict a set of data that is likely to be requested during an upcoming time period, and it is determined whether the complete set of predicted data exists in the data cache.

...read moreread less

Abstract: Methods and apparatus, including computer program products, implementing and using techniques for populating a data cache on a server. Data requests received by the server are collected in a repository. A data mining algorithm is applied to the collected data requests to predict a set of data that is likely to be requested during an upcoming time period. It is determined whether the complete set of predicted data exists in the data cache. If the complete set of predicted data does not exist in the data cache, the missing data is retrieved from a database and added to the data cache.

...read moreread less

Proceedings Article•DOI•

WCET Analysis of Multi-level Non-inclusive Set-Associative Instruction Caches

[...]

Damien Hardy, Isabelle Puaut

30 Nov 2008

TL;DR: This paper proposes a safe static instruction cache analysis method for multi-level non-inclusive caches, and shows that in all cases WCET estimations are much tighter when considering the cache hierarchy than when considering only the L1 cache.

...read moreread less

Abstract: With the advent of increasingly complex hardware in real-time embedded systems (processors with performance enhancing features such as pipelines, cache hierarchy, multiple cores), many processors now have a set-associative L2 cache. Thus, there is a need for considering cache hierarchies when validating the temporal behavior of real-time systems, in particular when estimating tasks' worst-case execution times (WCETs). In this paper, we propose a safe static instruction cache analysis method for multi-level non-inclusive caches. The proposed method is experimented on medium-size and large programs. We show that the method is reasonably tight. We further show that in all cases WCET estimations are much tighter when considering the cache hierarchy than when considering only the L1 cache. An evaluation of the analysis time is conducted, demonstrating that analyzing the cache hierarchy has a reasonable computation time.

...read moreread less

Proceedings Article•DOI•

Deconstructing new cache designs for thwarting software cache-based side channel attacks

[...]

Jingfei Kong¹, Onur Aciicmez², Jean-Pierre Seifert², Huiyang Zhou¹•Institutions (2)

University of Central Florida¹, Samsung²

31 Oct 2008

TL;DR: This paper analyzes two new cache designs to defeat cache-based side channel attacks by eliminating/obfuscating cache interferences and identifies significant vulnerabilities and shortcomings.

...read moreread less

Abstract: Software cache-based side channel attacks present a serious tthreat to computer systems. Previously proposed countermeasures were either too costly for practical use or only effective against particular attacks. Thus, a recent work identified cache interferences in general as the root cause and proposed two new cache designs, namely partition-locked cache (PLcache) and random permutation cache(RPcache), to defeat cache-based side channel attacks by eliminating/obfuscating cache interferences. In this paper, we analyze these new cache designs and identify significant vulnerabilities and shortcomings of those new cache designs. We also propose possible solutions and improvements over the original new cache designs to overcome the identified shortcomings.

...read moreread less

Proceedings Article•DOI•

Cache-efficient dynamic programming algorithms for multicores

[...]

Rezaul Chowdhury¹, Vijaya Ramachandran¹•Institutions (1)

University of Texas at Austin¹

14 Jun 2008

TL;DR: This work develops a generic CMP algorithm with an associated tiling sequence and provides a parallel schedule that results in a cache-efficient parallel execution up to the critical path length of the underlying dynamic programming algorithm.

...read moreread less

Abstract: We present cache-efficient chip multiprocessor (CMP) algorithms with good speed-up for some widely used dynamic programming algorithms. We consider three types of caching systems for CMPs: D-CMP with a private cache for each core, S-CMP with a single cache shared by all cores, and Multicore, which has private L1 caches and a shared L2 cache. We derive results for three classes of problems: local dependency dynamic programming (LDDP), Gaussian Elimination Paradigm (GEP), and parenthesis problem.For each class of problems, we develop a generic CMP algorithm with an associated tiling sequence. We then tailor this tiling sequence to each caching model and provide a parallel schedule that results in a cache-efficient parallel execution up to the critical path length of the underlying dynamic programming algorithm.We present experimental results on an 8-core Opteron for two sequence alignment problems that are important examples of LDDP. Our experimental results show good speed-ups for simple versions of our algorithms.

...read moreread less

Patent•

System and method for populating a cache using behavioral adaptive policies

[...]

Joaquin J. Aviles¹, Mark U. Cree¹, Gregory Dahl¹•Institutions (1)

NetApp¹

16 Jan 2008

TL;DR: In this paper, a method, system and program are disclosed for accelerating data storage in a cache appliance cluster that transparently monitors NFS and CIFS traffic between clients and NAS subsystems and caches files using dynamically adjustable cache policies.

...read moreread less

Abstract: A method, system and program are disclosed for accelerating data storage in a cache appliance cluster that transparently monitors NFS and CIFS traffic between clients and NAS subsystems and caches files using dynamically adjustable cache policies which populate the storage cache using behavioral adaptive policies that are based on analysis of clients-filers transaction patterns and network utilization, thereby improving access time to the data stored on the disk-based NAS filer (group) for predetermined applications.

...read moreread less

Proceedings Article•

On multi-level exclusive caching: offline optimality and why promotions are better than demotions

[...]

Binny S. Gill¹•Institutions (1)

IBM¹

26 Feb 2008

TL;DR: This work proposes a dramatically better performing alternative called PROMOTE, which provides exclusive caching in multi-level cache hierarchies without demotions or any of the overheads inherent in DEMOTE, and discovers theoretical bounds for optimal multi- level cache performance.

...read moreread less

Abstract: Multi-level cache hierarchies have become very common; however, most cache management policies result in duplicating the same data redundantly on multiple levels. The state-of-the-art exclusive caching techniques used to mitigate this wastage in multi-level cache hierarchies are the DEMOTE technique and its variants. While these achieve good hit ratios, they suffer from significant I/O and computational overheads, making them unsuitable for deployment in real-life systems. We propose a dramatically better performing alternative called PROMOTE, which provides exclusive caching in multi-level cache hierarchies without demotions or any of the overheads inherent in DEMOTE. PROMOTE uses an adaptive probabilistic filtering technique to decide which pages to "promote" to caches closer to the application. While both DEMOTE and PROMOTE provide the same aggregate hit ratios, PROMOTE achieves more hits in the highest cache levels leading to better response times. When inter-cache bandwidth is limited, PROMOTE convincingly outperforms DEMOTE by being 2x more efficient in bandwidth usage. For example, in a trace from a real-life scenario, PROMOTE provided an average response time of 3.42ms as compared to 5.05ms for DEMOTE on a two-level hierarchy of LRU caches, and 5.93ms as compared to 7.57ms on a three-level cache hierarchy. We also discover theoretical bounds for optimal multi-level cache performance. We devise two offline policies, called OPT-UB and OPT-LB, that provably serve as upper and lower bounds on the theoretically optimal performance of multi-level cache hierarchies. For a series of experiments on a wide gamut of traces and cache sizes OPT-UB and OPTLB ran within 2.18% and 2.83% of each other for two-cache and three-cache hierarchies, respectively. These close bounds will help evaluate algorithms and guide future improvements in the field of multi-level caching.

...read moreread less

Patent•

Sharing and persisting code caches

[...]

Derek L. Bruening¹, Vladimir Kiriansky¹•Institutions (1)

VMware¹

06 Jun 2008

TL;DR: In this article, a shared code caching engine receives native code comprising at least a portion of a single module of the application program, and stores runtime data corresponding to the native code in a cache data file in the non-volatile memory.

...read moreread less

Abstract: Computer code from an application program comprising a plurality of modules that each comprise a separately loadable file is code cached in a shared and persistent caching system. A shared code caching engine receives native code comprising at least a portion of a single module of the application program, and stores runtime data corresponding to the native code in a cache data file in the non-volatile memory. The engine then converts cache data file into a code cache file and enables the code cache file to be pre-loaded as a runtime code cache. These steps are repeated to store a plurality of separate code cache files at different locations in non-volatile memory.

...read moreread less

Proceedings Article•DOI•

Prefetching irregular references for software cache on cell

[...]

Tong Chen¹, Tao Zhang¹, Zehra Sura¹, Mar Gonzales Tallada¹•Institutions (1)

IBM¹

06 Apr 2008

TL;DR: This work proposes a method to prefetch irregular references accessed through a software cache that is built upon hardware such as Cell, and finds that when applicable, this prefetching can improve the performance of some benchmarks by 2 times on average, and by close to 4 times in the best case.

...read moreread less

Abstract: The IBM Single Source Research Compiler for the Cell processor (the SSC Research Compiler) was developed to manage the complexity of programming the heterogeneous multicore Cell processor. The compiler accepts conventional source programs as input, and automatically generates binaries that execute on both the PPU and SPU cores available on a Cell chip. The compiler uses a software cache and direct buffers to manage data in the small local memory of SPUs. However, irregular references, such as a[ind[i]], often become performance bottle-necks. These references are accessed through software cache, usually with high miss rates. To solve this problem, we propose a method to prefetch irregular references accessed through a software cache that is built upon hardware such as Cell. This method includes code transformation in the compiler and a runtime library component for the software cache. Our design simplifies the synchronization required when prefetching into software cache, overlaps DMA operations for misses, and avoids frequent context switching to the miss handler. It also minimizes the cache pollution caused by prefetching, by looking both forwards and backwards through the sequence of addresses to be prefetched. We evaluated our prefetching method using the NAS benchmarks. We found that when applicable, our prefetching can improve the performance of some benchmarks by 2 times on average, and by close to 4 times in the best case. We also present data to show the impact of different configurations and optimizations when prefetching in a software cache.

...read moreread less

Proceedings Article•

TaP: table-based prefetching for storage caches

[...]

Mingju Li¹, Elizabeth Varki¹, Swapnil Bhatia¹, Arif Merchant²•Institutions (2)

University of New Hampshire¹, Hewlett-Packard²

26 Feb 2008

TL;DR: TaP is a storage cache sequential prefetching and caching technique to improve the read-ahead cache hit rate and system response time and the use of a table to detect sequential access patterns in the I/O workload and to dynamically determine the optimum prefetch cache size.

...read moreread less

Abstract: TaP is a storage cache sequential prefetching and caching technique to improve the read-ahead cache hit rate and system response time. A unique feature of TaP is the use of a table to detect sequential access patterns in the I/O workload and to dynamically determine the optimum prefetch cache size. When compared to some popular prefetching techniques, TaP gives a better hit rate and response time while using a read cache that is often an order of magnitude smaller than that needed by other techniques. TaP is especially efficient when the I/O workload consists of interleaved requests from various applications, where only some of the applications are accessing their data sequentially. For example, TaP achieves the same hit rate as the other techniques with a cache length that is 100 times smaller than the cache needed by other techniques when the interleaved workload consists of 10% sequential application data and 90% random application data.

...read moreread less

Patent•

High bandwidth cache-to-processing unit communication in a multiple processor/cache system

[...]

Perry H. Pelley¹, Michael B. McShane•Institutions (1)

Freescale Semiconductor¹

05 Feb 2008

TL;DR: A processor/cache assembly has a processor die coupled to a cache die as discussed by the authors, and the processor die has a plurality of processor units arranged in an array, each cache set is in contact with one corresponding processor set.

...read moreread less

Abstract: A processor/cache assembly has a processor die coupled to a cache die. The processor die has a plurality of processor units arranged in an array. There is a plurality of processor sets of contact pads on the processor units, one processor set for each processor unit. Similarly, the cache die has a plurality of cache units arranged in an array. There is a plurality of cache sets of contact pads on the cache die, one cache set for each cache unit. Each cache set is in contact with one corresponding processor set.

...read moreread less

Proceedings Article•DOI•

An OS-based alternative to full hardware coherence on tiled CMPs

[...]

Christian Fensch¹, Marcelo Cintra¹•Institutions (1)

University of Edinburgh¹

24 Oct 2008

TL;DR: This paper proposes a novel, cost-effective mechanism to support shared-memory parallel applications that forgoes hardware maintained cache coherence and provides a sufficient degree of flexibility in the mapping through an extra level of indirection between virtual pages and physical tiles.

...read moreread less

Abstract: The interconnect mechanisms (shared bus or crossbar) used in current chip-multiprocessors (CMPs) are expected to become a bottleneck that prevents these architectures from scaling to a larger number of cores. Tiled CMPs offer better scalability by integrating relatively simple cores with a lightweight point-to-point interconnect. However, such interconnects make snooping impractical and, thus, require alternative solutions to cache coherence. This paper proposes a novel, cost-effective mechanism to support shared-memory parallel applications that forgoes hardware maintained cache coherence. The proposed mechanism is based on the key ideas that mapping of lines to physical caches is done at the page level with OS support and that hardware supports remote cache accesses. It allows only some controlled migration and replication of data and provides a sufficient degree of flexibility in the mapping through an extra level of indirection between virtual pages and physical tiles. We evaluate the proposed tiled CMP architecture on the Splash-2 scientific benchmarks and ALPBench multimedia benchmarks against one with private caches and a distributed directory cache coherence mechanism. Experimental results show that the performance degradation is as little as 0%, and 16% on average, compared to the cache coherent architecture across all benchmarks for 16 and 32 processors.

...read moreread less

Patent•

Hiding conflict, coherence completion and transaction ID elements of a coherence protocol

[...]

Benjamin Tsien¹•Institutions (1)

Intel¹

14 Oct 2008

TL;DR: In this paper, an apparatus having one or more cache agents and a protocol agent is disclosed, and the protocol agent includes a structure to handle conflict resolution, while the cache operation events are handled by a hierarchy of agents.

...read moreread less

Abstract: According to one embodiment of the invention, an apparatus having one or more cache agents and a protocol agent is disclosed. The protocol agent is coupled to the one or more cache agents to receive events corresponding to cache operations from the one or more cache agents to maintain ordering with respect to the cache operation events. The protocol agent includes a structure to handle conflict resolution.

...read moreread less

Journal Article•DOI•

Characterizing the miss sequence of the LRU cache

[...]

Predrag R. Jelenković¹, Xiaozhu Kang¹•Institutions (1)

Columbia University¹

31 Aug 2008

TL;DR: It is discovered that the miss stream from one single cache is approximated well by the superposition of a number of asymptotically independent renewal processes, which is likely to enable the development of a rigorous analysis of the tandem cache performance.

...read moreread less

Abstract: Renewed interest in caching systems stems from their wide-spread use for reducing the document download latency over the Internet. Since caches are usually organized in a hierarchical manner, it is important to study the performance properties of tandem caches. The first step in understanding this problem is to characterize the miss stream from one single cache since it represents the input to the next level cache. In this regard, we discover that the miss stream from one single cache is approximated well by the superposition of a number of asymptotically independent renewal processes. Interestingly, when this weakly correlated miss sequence is fed into another cache, this barely observable correlation can lead to measurably different caching performance when compared to the independent reference model. This result is likely to enable the development of a rigorous analysis of the tandem cache performance.

...read moreread less

Journal Article•DOI•

Scalable query result caching for web applications

[...]

Charles Garrod¹, Amit Manjhi², Anastasia Ailamaki¹, Bruce M. Maggs¹, Todd C. Mowry¹, Christopher Olston³, Anthony Tomasic¹ - Show less +3 more•Institutions (3)

Carnegie Mellon University¹, Google², Yahoo!³

01 Aug 2008

TL;DR: This paper introduces Ferdinand, the first proxy-based cooperative query result cache with fully distributed consistency management, and implements a fully functioning Ferdinand prototype and evaluates its performance compared to several alternative query-caching approaches, showing that high cache hit rate and consistency management are both critical for Ferdinand's performance gains over existing systems.

...read moreread less

Abstract: The backend database system is often the performance bottleneck when running web applications. A common approach to scale the database component is query result caching, but it faces the challenge of maintaining a high cache hit rate while efficiently ensuring cache consistency as the database is updated. In this paper we introduce Ferdinand, the first proxy-based cooperative query result cache with fully distributed consistency management. To maintain a high cache hit rate, Ferdinand uses both a local query result cache on each proxy server and a distributed cache. Consistency management is implemented with a highly scalable publish/subscribe system. We implement a fully functioning Ferdinand prototype and evaluate its performance compared to several alternative query-caching approaches, showing that our high cache hit rate and consistency management are both critical for Ferdinand's performance gains over existing systems.

...read moreread less

Journal Article•DOI•

Non deterministic caches: a simple and effective defense against side channel attacks

[...]

Georgios Keramidas¹, A. Antonopoulos¹, Dimitrios Serpanos¹, Stefanos Kaxiras¹•Institutions (1)

University of Patras¹

01 Sep 2008-Design Automation for Embedded Systems

TL;DR: This work uses a cycle-based processor simulator, enhanced with the required modifications, in order to evaluate the use of the Cache Decay approach as an aid to guard against cache-based side channel attacks, and shows that the technique can be used effectively to protect against Cache Decay attacks.

...read moreread less

Abstract: Side channel cryptanalysis has received significant attention lately, because it provides a low-cost and facile way to reveal the secret information held on a secure computing system. One particular type of side channel attacks, called cache-based side channel attacks, aims to deduce information about the state of a cryptographic algorithm or its key by observing the data-dependent behavior of a microprocessor's cache memory. These attacks have been proven successful and very hard to protect against. In this paper, we introduce the use of the Cache Decay approach as an aid to guard against cache-based side channel attacks. Cache Decay controls the lifetime (called decay interval) of the cache items and was initially proposed for cache power leakage savings. By randomly selecting the decay interval of the cache, we actually create caches with non-deterministic behavior in regard to their statistics. Thus, as we demonstrate, multiple runs of the same algorithm (performing on the same input) will result in different cache statistics, defending against the attacker and reinforcing the protection offered by the system. In our work, we use a cycle-based processor simulator, enhanced with the required modifications, in order to evaluate our proposal and show that our technique can be used effectively to protect against cache-based side channel attacks.

...read moreread less

Proceedings Article•DOI•

Hybrid access-specific software cache techniques for the cell BE architecture

[...]

Marc Gonzalez¹, Nikola Vujic¹, Xavier Martorell¹, Eduard Ayguadé¹, Alexandre E. Eichenberger², Tong Chen², Zehra Sura², Tao Zhang², Kevin O'Brien², Kathryn M. O'Brien² - Show less +6 more•Institutions (2)

Polytechnic University of Catalonia¹, IBM²

25 Oct 2008

TL;DR: A hierarchical, hybrid software-cache architecture that classifies at compile time memory accesses in two classes, high-locality and irregular, is proposed that can be a competitive alternative to a modern server-class multi-core such as the IBM Power5 processor for a set of parallel NAS applications.

...read moreread less

Abstract: Ease of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this paper, we propose a hierarchical, hybrid software-cache architecture that classifies at compile time memory accesses in two classes, high-locality and irregular. Our approach then steers the memory references toward one of two specific cache structures optimized for their respective access pattern. The specific cache structures are optimized to enable high-level compiler optimizations to aggressively unroll loops, reorder cache references, and/or transform surrounding loops so as to practically eliminate the software cache overhead in the innermost loop. Performance evaluation indicates that improvements due to the optimized software-cache structures combined with the proposed code-optimizations translate into 3.5 to 8.4 speedup factors, compared to a traditional software cache approach. As a result, we demonstrate that the Cell BE processor can be a competitive alternative to a modern server-class multi-core such as the IBM Power5 processor for a set of parallel NAS applications.

...read moreread less

Patent•

Methods and structure for improved storage system performance with write-back caching for disk drives

[...]

Donald R. Humlicek¹•Institutions (1)

LSI Corporation¹

14 Aug 2008

TL;DR: In this paper, a state machine model of managing cache blocks in a storage controller cache memory maintains blocks in the storage controller's cache memory in a new state until verification is sensed that the blocks have been successfully stored on the persistent storage media of the affected disk drives.

...read moreread less

Abstract: Methods and associated structures for utilizing write-back cache management modes for local cache memory of disk drives coupled to a storage controller while maintaining data integrity of the data transferred to the local cache memories of affected disk drives. In one aspect hereof, a state machine model of managing cache blocks in a storage controller cache memory maintains blocks in the storage controller's cache memory in a new state until verification is sensed that the blocks have been successfully stored on the persistent storage media of the affected disk drives. Responsive to failure or other reset of the disk drive, the written cache blocks may be re-written from the copy maintained in the cache memory of the storage controller. In another aspect, an alternate controller's cache memory may also be used to mirror the cache blocks from the primary storage controller's cache memory as additional data integrity assurance.

...read moreread less

Patent•

Method to handle demand based dynamic cache allocation between SSD and RAID cache

[...]

Mahmoud K. Jibbe¹, Senthil Kannan•Institutions (1)

LSI Corporation¹

19 Feb 2008

TL;DR: In this article, the authors propose an apparatus and method to dynamically allocate cache in a SAN controller between a first fixed cache comprising traditional RAID cache comprised of RAM and a second scalable RAID cache comprising of SSDs (Solid State Devices).

...read moreread less

Abstract: An apparatus and method to dynamically allocate cache in a SAN controller between a first fixed cache comprising traditional RAID cache comprised of RAM and a second, scalable RAID cache comprising of SSDs (Solid State Devices). The method is dynamic and switches between the first and second cache depending on IO demand.

...read moreread less

Collapse