Showing papers on "Smart Cache published in 2007"

PDF

Open Access

Proceedings Article•DOI•

Adaptive insertion policies for high performance caching

[...]

Moinuddin K. Qureshi¹, Aamer Jaleel², Yale N. Patt¹, Simon C. Steely², Joel Emer² - Show less +1 more•Institutions (2)

09 Jun 2007

TL;DR: A Dynamic Insertion Policy (DIP) is proposed to choose between BIP and the traditional LRU policy depending on which policy incurs fewer misses, and shows that DIP reduces the average MPKI of the baseline 1MB 16-way L2 cache by 21%, bridging two-thirds of the gap between LRU and OPT.

...read moreread less

Abstract: The commonly used LRU replacement policy is susceptible to thrashing for memory-intensive workloads that have a working set greater than the available cache size. For such applications, the majority of lines traverse from the MRU position to the LRU position without receiving any cache hits, resulting in inefficient use of cache space. Cache performance can be improved if some fraction of the working set is retained in the cache so that at least that fraction of the working set can contribute to cache hits.We show that simple changes to the insertion policy can significantly reduce cache misses for memory-intensive workloads. We propose the LRU Insertion Policy (LIP) which places the incoming line in the LRU position instead of the MRU position. LIP protects the cache from thrashing and results in close to optimal hitrate for applications that have a cyclic reference pattern. We also propose the Bimodal Insertion Policy (BIP) as an enhancement of LIP that adapts to changes in the working set while maintaining the thrashing protection of LIP. We finally propose a Dynamic Insertion Policy (DIP) to choose between BIP and the traditional LRU policy depending on which policy incurs fewer misses. The proposed insertion policies do not require any change to the existing cache structure, are trivial to implement, and have a storage requirement of less than two bytes. We show that DIP reduces the average MPKI of the baseline 1MB 16-way L2 cache by 21%, bridging two-thirds of the gap between LRU and OPT.

...read moreread less

722 citations

Proceedings Article•DOI•

New cache designs for thwarting software cache-based side channel attacks

[...]

Zhenghong Wang¹, Ruby B. Lee¹•Institutions (1)

Princeton University¹

09 Jun 2007

TL;DR: The results show that the new cache designs with built-in security can defend against cache-based side channel attacks in general-rather than only specific attacks on a given cryptographic algorithm-with very little performance degradation and hardware cost.

...read moreread less

Abstract: Software cache-based side channel attacks are a serious new class of threats for computers. Unlike physical side channel attacks that mostly target embedded cryptographic devices, cache-based side channel attacks can also undermine general purpose systems. The attacks are easy to perform, effective on most platforms, and do not require special instruments or excessive computation power. In recently demonstrated attacks on software implementations of ciphers like AES and RSA, the full key can be recovered by an unprivileged user program performing simple timing measurements based on cache misses.We first analyze these attacks, identifying cache interference as the root cause of these attacks. We identify two basic mitigation approaches: the partition-based approach eliminates cache interference whereas the randomization-based approach randomizes cache interference so that zero information can be inferred. We present new security-aware cache designs, the Partition-Locked cache (PLcache) and Random Permutation cache (RPcache), analyze and prove their security, and evaluate their performance. Our results show that our new cache designs with built-in security can defend against cache-based side channel attacks in general-rather than only specific attacks on a given cryptographic algorithm-with very little performance degradation and hardware cost.

...read moreread less

594 citations

Journal Article•DOI•

A NUCA Substrate for Flexible CMP Cache Sharing

[...]

Jaehyuk Huh¹, Changkyu Kim, Hazim Shafi², Lixin Zhang³, Doug Burger, Stephen W. Keckler - Show less +2 more•Institutions (3)

Advanced Micro Devices¹, Microsoft², IBM³

01 Aug 2007-IEEE Transactions on Parallel and Distributed Systems

TL;DR: It is demonstrated that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied.

...read moreread less

Abstract: We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in terms of network traffic, bank accesses, and external memory accesses.

...read moreread less

319 citations

Proceedings Article•DOI•

Cooperative cache partitioning for chip multiprocessors

[...]

Jichuan Chang¹, Gurindar S. Sohi¹•Institutions (1)

University of Wisconsin-Madison¹

17 Jun 2007

TL;DR: For workloads that can benefit from cache partitioning, CCP achieves up to 60%, and on average 12%, better performance than the exhaustive search of optimal static partitions, and provides the best results on almost all evaluation metrics for different cache sizes.

...read moreread less

Abstract: This paper presents Cooperative Cache Partitioning (CCP) to allocate cache resources among threads concurrently running on CMPs. Unlike cache partitioning schemes that use a single spatial partition repeatedly throughout a stable program phase, CCP resolves cache contention with multiple time-sharing partitions. Timesharing cache resources among partitions allows each thrashing thread to speed up dramatically in at least one partition by unfairly shrinking other threads' capacity allocations, while improving fairness by giving different partitions equal chance to execute. Quality-of-Service (QoS) is guaranteed over the long term by orchestrating the shrink and expansion of each thread's capacity across partitions to bound the average slowdown. Time-sharing based cache partitioning is further integrated with CMP cooperative caching [6] to exploit the benefits of LRU-based latency optimizations, which leads to a simplified partitioning algorithm and better performance for workloads that do not benefit from cache partitioning. We evaluate the effectiveness of CCP by simulating a 4-core CMP running all combinations of 7 representative SPEC2000 benchmarks. For workloads that can benefit from cache partitioning, CCP achieves up to 60%, and on average 12%, better performance than the exhaustive search of optimal static partitions. Overall, CCP provides the best results on almost all evaluation metrics for different cache sizes.

...read moreread less

280 citations

Patent•

Methods and Systems for Caching Content at Multiple Levels

[...]

Chris King¹, Steve Mullaney¹, Jamshid Mahdavi¹, Ravikumar Venkata Duvvuri¹•Institutions (1)

Blue Coat Systems¹

23 Mar 2007

TL;DR: A cache includes an object cache layer and a byte cache layer, each configured to store information to storage devices included in the cache appliance as mentioned in this paper, and an application proxy layer may also be included.

...read moreread less

Abstract: A cache includes an object cache layer and a byte cache layer, each configured to store information to storage devices included in the cache appliance. An application proxy layer may also be included. In addition, the object cache layer may be configured to identify content that should not be cached by the byte cache layer, which itself may be configured to compress contents of the object cache layer. In some cases the contents of the byte cache layer may be stored as objects within the object cache.

...read moreread less

208 citations

Proceedings Article•DOI•

Scheduling threads for constructive cache sharing on CMPs

[...]

Shimin Chen¹, Phillip B. Gibbons¹, Michael Kozuch¹, Vasileios Liaskovitis², Anastassia Ailamaki², Guy E. Blelloch², Babak Falsafi², Limor Fix¹, Nikos Hardavellas², Todd C. Mowry², Christopher B. Wilkerson¹ - Show less +7 more•Institutions (2)

Intel¹, Carnegie Mellon University²

09 Jun 2007

TL;DR: This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.

...read moreread less

Abstract: In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3--1.6X performance improvement relative to WS for several fine-grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.

...read moreread less

174 citations

Proceedings Article•DOI•

An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors

[...]

Haakon Dybdahl¹, Per Stenström²•Institutions (2)

Norwegian University of Science and Technology¹, Chalmers University of Technology²

10 Feb 2007

TL;DR: This work proposes a novel non-uniform cache architecture in which the amount of cache space that can be shared among the cores is controlled dynamically and shows that this scheme outperforms a private and shared cache organization as well as a hybrid NUCA organization in which blocks in a local partition can spill over to neighbor core partitions.

...read moreread less

Abstract: The significant speed-gap between processor and memory and the limited chip memory bandwidth make last-level cache performance crucial for future chip multiprocessors. To use the capacity of shared last-level caches efficiently and to allow for a short access time, proposed non-uniform cache architectures (NUCAs) are organized into per-core partitions. If a core runs out of cache space, blocks are typically relocated to nearby partitions, thus managing the cache as a shared cache. This uncontrolled sharing of all resources may unfortunately result in pollution that degrades performance. We propose a novel non-uniform cache architecture in which the amount of cache space that can be shared among the cores is controlled dynamically. The adaptive scheme estimates, continuously, the effect of increasing/decreasing the shared partition size on the overall performance. We show that our scheme outperforms a private and shared cache organization as well as a hybrid NUCA organization in which blocks in a local partition can spill over to neighbor core partitions

...read moreread less

153 citations

Proceedings Article•DOI•

Cache replacement based on reuse-distance prediction

[...]

Georgios Keramidas¹, Pavlos Petoumenos¹, Stefanos Kaxiras¹•Institutions (1)

University of Patras¹

01 Oct 2007

TL;DR: This work proposes to directly predict reuse-distances via instruction-based (PC) prediction and use this information for cache level optimizations and evaluates the reusedistance based replacement policy of the L2 cache using a subset of the most memory intensive SPEC2000.

...read moreread less

Abstract: Several cache management techniques have been proposed that indirectly try to base their decisions on cacheline reuse-distance, like Cache Decay which is a postdiction of reuse-distances: if a cacheline has not been accessed for some ldquodecay intervalrdquo we know that its reuse-distance is at least as large as this decay interval. In this work, we propose to directly predict reuse-distances via instruction-based (PC) prediction and use this information for cache level optimizations. In this paper, we choose as our target for optimization the replacement policy of the L2 cache, because the gap between the LRU and the theoretical optimal replacement algorithm is comparatively large for L2 caches. This indicates that, in many situations, there is ample room for improvement. We evaluate our reusedistance based replacement policy using a subset of the most memory intensive SPEC2000 and our results show significant benefits across the board.

...read moreread less

139 citations

Journal Article•DOI•

GroCoca: group-based peer-to-peer cooperative caching in mobile environment

[...]

Chi-Yin Chow¹, Hong Va Leong², Alvin T. S. Chan²•Institutions (2)

University of Minnesota¹, Hong Kong Polytechnic University²

01 Jan 2007-IEEE Journal on Selected Areas in Communications

TL;DR: In GroCoca, a GROup-based COoperative CAching scheme, a family of algorithms is proposed to discover and maintain all TCGs dynamically and two cooperative cache management protocols are designed to control data replicas and improve data accessibility in TCGs.

...read moreread less

Abstract: In a mobile cooperative caching environment, we observe the need for cooperating peers to cache useful data items together, so as to improve cache hit from peers. This could be achieved by capturing the data requirement of individual peers in conjunction with their mobility pattern, for which we realized via a GROup-based COoperative CAching scheme (GroCoca). In GroCoca, we define a tightly-coupled group (TCG) as a collection of peers that possess similar mobility pattern and display similar data affinity. A family of algorithms is proposed to discover and maintain all TCGs dynamically. Furthermore, two cooperative cache management protocols, namely, cooperative cache admission control and replacement, are designed to control data replicas and improve data accessibility in TCGs. A cache signature scheme is also adopted in GroCoca in order to provide information for the mobile clients to determine whether their TCG members are likely caching their desired data items and to perform cooperative cache replacement Experimental results show that GroCoca outperforms the conventional caching scheme and standard COoperative CAching scheme (COCA) in terms of access latency and global cache hit ratio. However, GroCoca generally incurs higher power consumption.

...read moreread less

122 citations

Proceedings Article•DOI•

Guaranteeing Hits to Improve the Efficiency of a Small Instruction Cache

[...]

Stephen Hines¹, David Whalley¹, Gary Tyson¹•Institutions (1)

Florida State University¹

01 Dec 2007

TL;DR: The tagless hit instruction cache (TH-IC) is proposed, a technique for completely eliminating the performance penalty associated with filter caches, as well as a further reduction in energy consumption due to not having to access the tag array on cache hits.

...read moreread less

Abstract: Very small instruction caches have been shown to greatly reduce fetch energy. However, for many appli- cations the use of a small filter cache can lead to an unacceptable increase in execution time. In this paper, we propose the Tagless Hit Instruction Cache (TH-IC), a technique for completely eliminating the performance penalty associated with filter caches, as well as a fur- ther reduction in energy consumption due to not having to access the tag array on cache hits. Using a few meta- data bits per line, we are able to more efficiently track the cache contents and guarantee when hits will occur in our small TH-IC. When a hit is not guaranteed, we can instead fetch directly from the L1 instruction cache, eliminating any additional cycles due to a TH-IC miss. Experimental results show that the overall processor en- ergy consumption can be significantly reduced due to the faster application running time and the elimination of tag comparisons for most of the accesses.

...read moreread less

122 citations

Journal Article•DOI•

A 65-nm Dual-Core Multithreaded Xeon® Processor With 16-MB L3 Cache

[...]

Stefan Rusu¹, Simon M. Tam¹, Harry Muljono¹, David J. Ayers¹, J. Chang¹, B. Cherkauer¹, J. Stinson¹, John Benoit¹, Raj Varada¹, Justin Leung¹, Rahul Limaye¹, Sujal Vora¹ - Show less +8 more•Institutions (1)

Intel¹

01 Jan 2007

TL;DR: This paper describes a dual-core 64-b Xeon MP processor implemented in a 65-nm eight-metal process that implements both sleep and shut-off leakage reduction modes and employs multiple voltage and clock domains to reduce power.

...read moreread less

Abstract: This paper describes a dual-core 64-b Xeon MP processor implemented in a 65-nm eight-metal process. The 435-mm2 die has 1.328-B transistors. Each core has two threads and a unified 1-MB L2 cache. The 16-MB shared, 16-way set-associative L3 cache implements both sleep and shut-off leakage reduction modes. Long channel transistors are used to reduce subthreshold leakage in cores and uncore (all portions of the die that are outside the cores) control logic. Multiple voltage and clock domains are employed to reduce power

...read moreread less

Patent•

Method for selectively enabling and disabling read caching in a storage subsystem

[...]

Lee Charles La Frese¹, Joshua Douglas Martin¹, Justin Thomson Miller¹, Vernon W. Miller¹, James Russell Thompson¹, Yan Xu¹, Olga Yiparaki¹ - Show less +3 more•Institutions (1)

IBM¹

30 Jul 2007

TL;DR: In this paper, a mechanism for selectively disabling and enabling read caching based on past performance of the cache and current read/write requests is proposed to improve overall performance by using an autonomic algorithm to disable read caching.

...read moreread less

Abstract: A mechanism for selectively disabling and enabling read caching based on past performance of the cache and current read/write requests. The system improves overall performance by using an autonomic algorithm to disable read caching for regions of backend disk storage (i.e., the backstore) that have had historically low cache hit ratios. The result is that more cache becomes available for workloads with larger hit ratios, and less time and machine cycles are spent searching the cache for data that is unlikely to be there.

...read moreread less

Patent•

Caching in multicore and multiprocessor architectures

[...]

Anant Agarwal¹, Ian Rudolf Bratt¹, Matthew Mattina¹•Institutions (1)

Tilera¹

25 May 2007

TL;DR: In this article, a multicore processor comprises a plurality of cache memories, each associated with one cache memory, and each of the cache memories is configured to maintain at least a portion of cache memory in which each cache line is dynamically managed as either local to the associated processor core or shared among multiple processor cores.

...read moreread less

Abstract: A multicore processor comprises a plurality of cache memories, and a plurality of processor cores, each associated with one of the cache memories. Each of at least some of the cache memories is configured to maintain at least a portion of the cache memory in which each cache line is dynamically managed as either local to the associated processor core or shared among multiple processor cores.

...read moreread less

Proceedings Article•DOI•

Interconnect design considerations for large NUCA caches

[...]

Naveen Muralimanohar, Rajeev Balasubramonian¹•Institutions (1)

University of Utah¹

09 Jun 2007

TL;DR: This work extends the widely-used CACTI cache modeling tool to take network design parameters into account and proposes novel cache access optimizations that introduce heterogeneity within the inter-bank network to alleviate the interconnect delay bottleneck.

...read moreread less

Abstract: The ever increasing sizes of on-chip caches and the growing domination of wire delay necessitate significant changes to cache hierarchy design methodologies Many recent proposals advocate splitting the cache into a large number of banks and employing a network-on-chip (NoC) to allow fast access to nearby banks (referred to as Non-Uniform Cache Architectures--NUCA) Most studies on NUCA organizations have assumed a generic NoC and focused on logical policies for cache block placement, movement, and search Since wire/router delay and power are major limiting factors in modern processors, this work focuses on interconnect design and its influence on NUCA performance and power We extend the widely-used CACTI cache modeling tool to take network design parameters into account With these overheads appropriately accounted for, the optimal cache organization is typically very different from that assumed in prior NUCA studies To alleviate the interconnect delay bottleneck, we propose novel cache access optimizations that introduce heterogeneity within the inter-bank network The careful consideration of interconnect choices for a large cache results in a 51% performance improvement over a baseline generic NoC and the introduction of heterogeneity within the network yields an additional 11-15% performance improvement

...read moreread less

Journal Article•DOI•

The 65-nm 16-MB Shared On-Die L3 Cache for the Dual-Core Intel Xeon Processor 7100 Series

[...]

J. Chang¹, Ming Huang¹, Jonathan Shoemaker¹, John Benoit¹, Szu-Liang Chen¹, Wei Chen¹, Siufu Chiu¹, Raghuraman Ganesan¹, G. Leong¹, Venkata Lukka¹, Stefan Rusu¹, Durgesh Srivastava¹ - Show less +8 more•Institutions (1)

Intel¹

26 Mar 2007-IEEE Journal of Solid-state Circuits

TL;DR: The 16-way set associative, single-ported 16-MB cache for the Dual-Core Intel Xeon Processor 7100 Series uses a 0.624 mum2 cell in a 65-nm 8-metal technology to minimize both leakage and dynamic power.

...read moreread less

Abstract: The 16-way set associative, single-ported 16-MB cache for the Dual-Core Intel Xeon Processor 7100 Series uses a 0.624 mum2 cell in a 65-nm 8-metal technology. Low power techniques are implemented in the L3 cache to minimize both leakage and dynamic power. Sleep transistors are used in the SRAM array and peripherals, reducing the cache leakage by more than 2X. Only 0.8% of the cache is powered up for a cache access. Dynamic cache line disable (Intel Cache Safe Technology) with a history buffer protects the cache from latent defects and infant mortality failures

...read moreread less

Proceedings Article•DOI•

Emulating Optimal Replacement with a Shepherd Cache

[...]

Kaushik Rajan¹, Ramaswamy Govindarajan¹•Institutions (1)

Indian Institute of Science¹

01 Dec 2007

TL;DR: This work proposes a novel replacement strategy that mimics the replacement decisions of OPT and can cover 40% of the gap between OPT and LRU for a 2MB cache resulting in 7% overall speedup.

...read moreread less

Abstract: The inherent temporal locality in memory accesses is filtered out by the L1 cache. As a consequence, an L2 cache with LRU replacement incurs significantly higher misses than the optimal replacement policy (OPT). We propose to narrow this gap through a novel replacement strategy that mimics the replacement decisions of OPT. The L2 cache is logically divided into two components, a Shepherd Cache (SC) with a simple FIFO replacement and a Main Cache (MC) with an emulation of optimal replacement. The SC plays the dual role of caching lines and guiding the replacement decisions in MC. Our pro- posed organization can cover 40% of the gap between OPT and LRU for a 2MB cache resulting in 7% overall speedup. Comparison with the dynamic insertion policy, a victim buffer, a V-Way cache and an LRU based fully associative cache demonstrates that our scheme performs better than all these strategies.

...read moreread less

Proceedings Article•

Karma: know-it-all replacement for a multilevel cache

[...]

Gala Yadgar¹, Michael Factor², Assaf Schuster¹•Institutions (2)

Technion – Israel Institute of Technology¹, IBM²

13 Feb 2007

TL;DR: Karma is presented, a global non-centralized, dynamic and informed management policy for multiple levels of cache that leverages application hints to make informed allocation and replacement decisions in all cache levels, preserving exclusive caching and adjusting to changes in access patterns.

...read moreread less

Abstract: Multilevel caching, common in many storage configurations, introduces new challenges to traditional cache management: data must be kept in the appropriate cache and replication avoided across the various cache levels Some existing solutions focus on avoiding replication across the levels of the hierarchy, working well without information about temporal locality-information missing at all but the highest level of the hierarchy Others use application hints to influence cache contents We present Karma, a global non-centralized, dynamic and informed management policy for multiple levels of cache Karma leverages application hints to make informed allocation and replacement decisions in all cache levels, preserving exclusive caching and adjusting to changes in access patterns We show the superiority of Karma through comparison to existing solutions including LRU, 2Q, ARC, MultiQ, LRU-SP, and Demote, demonstrating better cache performance than all other solutions and up to 85% better performance than LRU on representative workloads

...read moreread less

Proceedings Article•DOI•

Exploring DRAM cache architectures for CMP server platforms

[...]

Li Zhao¹, Ravi Iyer¹, Ramesh Illikkal¹, Donald Newell¹•Institutions (1)

Intel¹

01 Oct 2007

TL;DR: This paper investigates the impact of introducing a low latency, large capacity and high bandwidth DRAM-based cache between the last level SRAM cache and memory subsystem, and identifies the most efficient DRAM cache organization.

...read moreread less

Abstract: As dual-core and quad-core processors arrive in the marketplace, the momentum behind CMP architectures continues to grow strong. As more and more cores/threads are placed on-die, the pressure on the memory subsystem is rapidly increasing. To address this issue, we explore DRAM cache architectures for CMP platforms. In this paper, we investigate the impact of introducing a low latency, large capacity and high bandwidth DRAM-based cache between the last level SRAM cache and memory subsystem. We first show the potential benefits of large DRAM caches for key commercial server workloads. As the primary hurdle to achieving these benefits with DRAM caches is the tag space overheads associated with them, we identify the most efficient DRAM cache organization and investigate various options. Our results show that the combination of 8-bit partial tags and 2-way sectoring achieves the highest performance (20% to 70%) with the lowest tag space (<25%) overhead.

...read moreread less

Proceedings Article•DOI•

Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines

[...]

Moinuddin K. Qureshi, M.A. Suleman, Yale N. Patt

10 Feb 2007

TL;DR: This work proposes line distillation (LDIS), a technique that retains only the used words and evicts the unused words in a cache line, and proposes distill cache, a cache organization to utilize the capacity created by LDIS.

...read moreread less

Abstract: Caches are organized at a line-size granularity to exploit spatial locality. However, when spatial locality is low, many words in the cache line are not used. Unused words occupy cache space but do not contribute to cache hits. Filtering these words can allow the cache to store more cache lines. We show that unused words in a cache line are unlikely to be accessed in the less recent part of the LRU stack. We propose line distillation (LDIS), a technique that retains only the used words and evicts the unused words in a cache line. We also propose distill cache, a cache organization to utilize the capacity created by LDIS. Our experiments with 16 memory-intensive benchmarks show that LDIS reduces the average misses for a 1MB 8-way L2 cache by 30% and improves the average IPC by 12%

...read moreread less

Proceedings Article•DOI•

Compile-time decided instruction cache locking using worst-case execution paths

[...]

Heiko Falk, Sascha Plazar, Henrik Theiling

30 Sep 2007

TL;DR: Techniques exploring the lockdown of instruction caches at compile-time to minimize WCETs are presented, which explicitly take the worst-case execution path into account during each step of the optimization procedure.

...read moreread less

Abstract: Caches are notorious for their unpredictability. It is difficult or even impossible to predict if a memory access results in a definite cache hit or miss. This unpredictability is highly undesired for real-time systems. The Worst-Case Execution Time (WCET) of a software running on an embedded processor is one of the most important metrics during real-time system design. The WCET depends to a large extent on the total amount of time spent for memory accesses. In the presence of caches, WCET analysis must always assume a memory access to be a cache miss if it can not be guaranteed that it is a hit. Hence, WCETs for cached systems are imprecise due to the overestimation caused by the caches. Modern caches can be controlled by software. The software can load parts of its code or of its data into the cache and lock the cache afterwards. Cache locking prevents the cache's contents from being flushed by deactivating the replacement. A locked cache is highly predictable and leads to very precise WCET estimates, because the uncertainty caused by the replacement strategy is eliminated completely. This paper presents techniques exploring the lockdown of instruction caches at compile-time to minimize WCETs. In contrast to the current state of the art in the area of cache locking, our techniques explicitly take the worst-case execution path into account during each step of the optimization procedure. This way, we can make sure that always those parts of the code are locked in the I-cache that lead to the highest WCET reduction. The results demonstrate that WCET reductions from 54% up to 73% can be achieved with an acceptable amount of CPU seconds required for the optimization and WCET analyses themselves.

...read moreread less

Proceedings Article•DOI•

A Novel Cooperative Caching Scheme for Wireless Ad Hoc Networks: GroupCaching

[...]

Yi-Wei Ting¹, Yeim-Kuan Chang¹•Institutions (1)

National Cheng Kung University¹

29 Jul 2007

TL;DR: By using the proposed GroupCaching, the caching space in MHs can be efficiently utilized and thus the redundancy of cached data is decreased and the average access latency is reduced.

...read moreread less

Abstract: In the mobile ad hoc network, a mobile host can communicate with others anywhere and anytime. Cooperative caching scheme can improve the accessibility of data objects. However, the cache hit ratio is reduced and access latency becomes longer significantly due to the mobility of MHs, energy consumption in battery, and limited wireless bandwidth. In this paper, we propose a novel cooperative caching scheme called GroupCaching (GC) which allows each MH and its 1-hop neighbors form a group. The caching status is exchanged and maintained periodically in a group. By using the proposed GroupCaching, the caching space in MHs can be efficiently utilized and thus the redundancy of cached data is decreased and the average access latency is reduced. We evaluate the performance of the GroupCaching by using NS2 and compare it with the existing schemes such as CacheData and ZoneCooperative. The experimental results show that the cache hit ratio is increased by about 3%~30% and the average latency is reduced by about 5%~25% compared with other schemes.

...read moreread less

Proceedings Article•DOI•

A self-tuning configurable cache

[...]

Ann Gordon-Ross¹, Frank Vahid¹•Institutions (1)

University of California, Riverside¹

04 Jun 2007

TL;DR: A self-tuning cache is introduced that performs transparent runtime cache tuning, thus relieving the application designer and/or compiler from predetermining an application's cache configuration.

...read moreread less

Abstract: The memory hierarchy of a system can consume up to 50% of microprocessor system power. Previous work has shown that tuning a configurable cache to a particular application can reduce memory subsystem energy by 62% on average. We introduce a self-tuning cache that performs transparent runtime cache tuning, thus relieving the application designer and/or compiler from predetermining an application's cache configuration. The self-tuning cache applies tuning at a determined tuning interval. A good interval balances tuning process energy overhead against the energy overhead of running in a sub-optimal cache configuration, which we show wastes much energy. We present a self-tuning cache that dynamically varies the tuning interval, resulting in average energy reduction of as much as 29%, falling within 13% of an oracle-based optimal method.

...read moreread less

Journal Article•DOI•

Miss Rate Prediction Across Program Inputs and Cache Configurations

[...]

Yutao Zhong¹, Steven Dropsho, Xipeng Shen², A. Studer³, Chen Ding⁴ - Show less +1 more•Institutions (4)

George Mason University¹, College of William & Mary², Carnegie Mellon University³, IEEE Computer Society⁴

01 Mar 2007-IEEE Transactions on Computers

TL;DR: An interactive visualization tool that uses a three-dimensional plot to show miss rate changes across program data sizes and cache sizes and its use in evaluating compiler transformations and other uses of this visualization tool include assisting machine and benchmark-set design.

...read moreread less

Abstract: Improving cache performance requires understanding cache behavior. However, measuring cache performance for one or two data input sets provides little insight into how cache behavior varies across all data input sets and all cache configurations. This paper uses locality analysis to generate a parameterized model of program cache behavior. Given a cache size and associativity, this model predicts the miss rate for arbitrary data input set sizes. This model also identifies critical data input sizes where cache behavior exhibits marked changes. Experiments show this technique is within 2 percent of the hit rate for set associative caches on a set of floating-point and integer programs using array and pointer-based data structures. Building on the new model, this paper presents an interactive visualization tool that uses a three-dimensional plot to show miss rate changes across program data sizes and cache sizes and its use in evaluating compiler transformations. Other uses of this visualization tool include assisting machine and benchmark-set design. The tool can be accessed on the Web at http://www.cs.rochester.edu/research/locality

...read moreread less

Journal Article•DOI•

Cooperative Caching Strategy in Mobile Ad Hoc Networks Based on Clusters

[...]

Narottam Chand¹, Ramesh C. Joshi¹, Manoj Misra¹•Institutions (1)

Indian Institute of Technology Roorkee¹

01 Oct 2007-Wireless Personal Communications

TL;DR: Simulation experiments show that CC caching mechanism achieves significant improvements in cache hit ratio and average query latency in comparison with other caching strategies.

...read moreread less

Abstract: In this paper, we present a scheme, called Cluster Cooperative (CC) for caching in mobile ad hoc networks. In CC scheme, the network topology is partitioned into non-overlapping clusters based on the physical network proximity. For a local cache miss, each client looks for data item in the cluster. If no client inside the cluster has cached the requested item, the request is forwarded to the next client on the routing path towards server. A cache replacement policy, called Least Utility Value with Migration (LUV-Mi) is developed. The LUV-Mi policy is suitable for cooperation in clustered ad hoc environment because it considers the performance of an entire cluster along with the performance of local client. Simulation experiments show that CC caching mechanism achieves significant improvements in cache hit ratio and average query latency in comparison with other caching strategies.

...read moreread less

Patent•

System and method for implementing a dynamic cache for a data storage system

[...]

Paul Yuedong Mu

31 Oct 2007

TL;DR: In this article, a cache controller is coupled to the data path for a data storage system and can be implemented as a filter in a filter framework, allowing an application to control caching of data to permit optimization of data flow for the particular application.

...read moreread less

Abstract: A dynamic cache system is configured to flexibly respond to changes in operating parameters of a data storage and retrieval system. A cache controller in the system implements a caching policy describing how and what data should be cached. The policy can provide different caching behavior based on exemplary parameters such as a user ID, a specified application or a given workload. The cache controller is coupled to the data path for a data storage system and can be implemented as a filter in a filter framework. The cache memory for storing cached data can be local or remote from the cache controller. The policies implemented in the cache controller permit an application to control caching of data to permit optimization of data flow for the particular application.

...read moreread less

Journal Article•DOI•

Data cache locking for tight timing calculations

[...]

Xavier Vera¹, Björn Lisper¹, Jingling Xue²•Institutions (2)

Mälardalen University College¹, University of New South Wales²

12 Dec 2007-ACM Transactions in Embedded Computing Systems

TL;DR: This article proposes and evaluates a new compiler framework that times cache behavior for multitasking systems and combines static cache analysis and cache-locking mechanisms to ensure that all intratask conflicts, and consequently, memory access times, are exactly predictable.

...read moreread less

Abstract: Caches have become increasingly important with the widening gap between main memory and processor speeds. Small and fast cache memories are designed to bridge this discrepancy. However, they are only effective when programs exhibit sufficient data locality. In addition, caches are a source of unpredictability, resulting in programs sometimes behaving in a different way than expected. Detailed information about the number of cache misses and their causes allows us to predict cache behavior and to detect bottlenecks. Small modifications in the source code may change memory patterns, thereby altering the cache behavior. Code transformations, which take the cache behavior into account, might result in a high cache performance improvement. However, cache memory behavior is very hard to predict, thus making the task of optimizing and timing cache behavior very difficult. This article proposes and evaluates a new compiler framework that times cache behavior for multitasking systems. Our method explores the use of cache partitioning and dynamic cache locking to provide worst-case performance estimates in a safe and tight way for multitasking systems. We use cache partitioning, which divides the cache among tasks to eliminate intertask cache interferences. We combine static cache analysis and cache-locking mechanisms to ensure that all intratask conflicts, and consequently, memory access times, are exactly predictable. The results of our experiments demonstrate the capability of our framework to describe cache behavior at compile time. We compare our timing approach with a system equipped with a nonpartitioned, but statically, locked data cache. Our method outperforms static cache locking for all analyzed task sets under various cache architectures, demonstrating that our fully predictable scheme does not compromise the performance of the transformed programs.

...read moreread less

Book Chapter•DOI•

An analytical model for time-driven cache attacks

[...]

Kris Tiri¹, Onur Aciicmez¹, Michael Neve¹, Flemming Andersen¹•Institutions (1)

Intel¹

26 Mar 2007

TL;DR: An analytical model for time-driven cache attacks that accurately forecasts the strength of a symmetric key cryptosystem based on 3 simple parameters: the number of lookup tables, the size of the lookup tables; and the length of the microprocessor's cache line is presented.

...read moreread less

Abstract: Cache attacks exploit side-channel information that is leaked by a microprocessor's cache. There has been a significant amount of research effort on the subject to analyze and identify cache side-channel vulnerabilities since early 2002. Experimental results support the fact that the effectiveness of a cache attack depends on the particular implementation of the cryptosystem under attack and on the cache architecture of the device this implementation is running on. Yet, the precise effect of the mutual impact between the software implementation and the cache architecture is still an unknown. In this manuscript, we explain the effect and present an analytical model for time-driven cache attacks that accurately forecasts the strength of a symmetric key cryptosystem based on 3 simple parameters: (1) the number of lookup tables; (2) the size of the lookup tables; (3) and the length of the microprocessor's cache line. The accuracy of the model has been experimentally verified on 3 different platforms with different implementations of the AES algorithm attacked by adversaries with different capabilities.

...read moreread less

Journal Article•DOI•

Cooperative caching in mobile ad hoc networks based on data utility

[...]

Narottam Chand, Ramesh C. Joshi¹, Manoj Misra¹•Institutions (1)

Indian Institute of Technology Roorkee¹

01 Jan 2007-Mobile Information Systems

TL;DR: A utility based cache replacement policy, least utility value (LUV), is proposed, to improve the data availability and reduce the local cache miss ratio and simulation results show that, LUV replacement policy substantially outperforms the LRU policy.

...read moreread less

Abstract: Cooperative caching, which allows sharing and coordination of cached data among clients, is a potential technique to improve the data access performance and availability in mobile ad hoc networks. However, variable data sizes, frequent data updates, limited client resources, insufficient wireless bandwidth and client's mobility make cache management a challenge. In this paper, we propose a utility based cache replacement policy, least utility value (LUV), to improve the data availability and reduce the local cache miss ratio. LUV considers several factors that affect cache performance, namely access probability, distance between the requester and data source/cache, coherency and data size. A cooperative cache management strategy, Zone Cooperative (ZC), is developed that employs LUV as replacement policy. In ZC one-hop neighbors of a client form a cooperation zone since the cost for communication with them is low both in terms of energy consumption and message exchange. Simulation experiments have been conducted to evaluate the performance of LUV based ZC caching strategy. The simulation results show that, LUV replacement policy substantially outperforms the LRU policy.

...read moreread less

Proceedings Article•DOI•

Performance of Graceful Degradation for Cache Faults

[...]

Hyunjin Lee¹, Sangyeun Cho¹, Bruce R. Childers¹•Institutions (1)

University of Pittsburgh¹

09 Mar 2007

TL;DR: This paper considers defects in cache memory and studies their impact on program performance using a fault degradable cache model, and proposes an efficient cache set remapping scheme to recover lost performance due to failed sets.

...read moreread less

Abstract: In sub-90nm technologies, more frequent hard faults pose a serious burden on processor design and yield control. In addition to manufacturing-time chip repair schemes, microarchitectural techniques to make processor components resilient to hard faults become increasingly important. This paper considers defects in cache memory and studies their impact on program performance using a fault degradable cache model. We first describe how defects at the circuit level in cache manifest themselves at the microarchitecture level. We then examine several strategies for masking faults, by disabling faulty resources, such as lines, sets, ways, ports, or even the whole cache. We also propose an efficient cache set remapping scheme to recover lost performance due to failed sets. Using a new simulation tool, called CAFE, we study how the cache faults impact program performance under the various masking schemes

...read moreread less

Patent•

Horizontally-shared cache victims in multiple core processors

[...]

Sanjay Vishin¹•Institutions (1)

MIPS Technologies¹

02 Mar 2007

TL;DR: Cache priority rules can be based on cache coherency data, load balancing schemes, and architectural characteristics of the processor as discussed by the authors, and the processor evaluates cache priority rules to determine whether victim lines are discarded, written back to system memory, or stored in other processor core units' caches.

...read moreread less

Abstract: A processor includes multiple processor core units, each including a processor core and a cache memory. Victim lines evicted from a first processor core unit's cache may be stored in another processor core unit's cache, rather than written back to system memory. If the victim line is later requested by the first processor core unit, the victim line is retrieved from the other processor core unit's cache. The processor has low latency data transfers between processor core units. The processor transfers victim lines directly between processor core units' caches or utilizes a victim cache to temporarily store victim lines while searching for their destinations. The processor evaluates cache priority rules to determine whether victim lines are discarded, written back to system memory, or stored in other processor core units' caches. Cache priority rules can be based on cache coherency data, load balancing schemes, and architectural characteristics of the processor.

...read moreread less

Collapse