scispace - formally typeset
Search or ask a question

Showing papers on "Cache pollution published in 2007"


Proceedings ArticleDOI
09 Jun 2007
TL;DR: A Dynamic Insertion Policy (DIP) is proposed to choose between BIP and the traditional LRU policy depending on which policy incurs fewer misses, and shows that DIP reduces the average MPKI of the baseline 1MB 16-way L2 cache by 21%, bridging two-thirds of the gap between LRU and OPT.
Abstract: The commonly used LRU replacement policy is susceptible to thrashing for memory-intensive workloads that have a working set greater than the available cache size. For such applications, the majority of lines traverse from the MRU position to the LRU position without receiving any cache hits, resulting in inefficient use of cache space. Cache performance can be improved if some fraction of the working set is retained in the cache so that at least that fraction of the working set can contribute to cache hits.We show that simple changes to the insertion policy can significantly reduce cache misses for memory-intensive workloads. We propose the LRU Insertion Policy (LIP) which places the incoming line in the LRU position instead of the MRU position. LIP protects the cache from thrashing and results in close to optimal hitrate for applications that have a cyclic reference pattern. We also propose the Bimodal Insertion Policy (BIP) as an enhancement of LIP that adapts to changes in the working set while maintaining the thrashing protection of LIP. We finally propose a Dynamic Insertion Policy (DIP) to choose between BIP and the traditional LRU policy depending on which policy incurs fewer misses. The proposed insertion policies do not require any change to the existing cache structure, are trivial to implement, and have a storage requirement of less than two bytes. We show that DIP reduces the average MPKI of the baseline 1MB 16-way L2 cache by 21%, bridging two-thirds of the gap between LRU and OPT.

722 citations


Book
10 Sep 2007
TL;DR: Is your memory hierarchy stopping your microprocessor from performing at the high level it should be?
Abstract: Is your memory hierarchy stopping your microprocessor from performing at the high level it should be? Memory Systems: Cache, DRAM, Disk shows you how to resolve this problem. The book tells you everything you need to know about the logical design and operation, physical design and operation, performance characteristics and resulting design trade-offs, and the energy consumption of modern memory hierarchies. You learn how to to tackle the challenging optimization problems that result from the side-effects that can appear at any point in the entire hierarchy.As a result you will be able to design and emulate the entire memory hierarchy. . Understand all levels of the system hierarchy -Xcache, DRAM, and disk. . Evaluate the system-level effects of all design choices. . Model performance and energy consumption for each component in the memory hierarchy.

659 citations


Proceedings ArticleDOI
09 Jun 2007
TL;DR: The results show that the new cache designs with built-in security can defend against cache-based side channel attacks in general-rather than only specific attacks on a given cryptographic algorithm-with very little performance degradation and hardware cost.
Abstract: Software cache-based side channel attacks are a serious new class of threats for computers. Unlike physical side channel attacks that mostly target embedded cryptographic devices, cache-based side channel attacks can also undermine general purpose systems. The attacks are easy to perform, effective on most platforms, and do not require special instruments or excessive computation power. In recently demonstrated attacks on software implementations of ciphers like AES and RSA, the full key can be recovered by an unprivileged user program performing simple timing measurements based on cache misses.We first analyze these attacks, identifying cache interference as the root cause of these attacks. We identify two basic mitigation approaches: the partition-based approach eliminates cache interference whereas the randomization-based approach randomizes cache interference so that zero information can be inferred. We present new security-aware cache designs, the Partition-Locked cache (PLcache) and Random Permutation cache (RPcache), analyze and prove their security, and evaluate their performance. Our results show that our new cache designs with built-in security can defend against cache-based side channel attacks in general-rather than only specific attacks on a given cryptographic algorithm-with very little performance degradation and hardware cost.

594 citations


Proceedings ArticleDOI
10 Feb 2007
TL;DR: This paper proposes a hardware transactional memory system called LogTM Signature Edition (LogTM-SE), which uses signatures to summarize a transactions read-and write-sets and detects conflicts on coherence requests (eager conflict detection), and allows cache victimization, unbounded nesting, thread context switching and migration, and paging.
Abstract: This paper proposes a hardware transactional memory (HTM) system called LogTM Signature Edition (LogTM-SE). LogTM-SE uses signatures to summarize a transactions read-and write-sets and detects conflicts on coherence requests (eager conflict detection). Transactions update memory "in place" after saving the old value in a per-thread memory log (eager version management). Finally, a transaction commits locally by clearing its signature, resetting the log pointer, etc., while aborts must undo the log. LogTM-SE achieves two key benefits. First, signatures and logs can be implemented without changes to highly-optimized cache arrays because LogTM-SE never moves cached data, changes a blocks cache state, or flash clears bits in the cache. Second, transactions are more easily virtualized because signatures and logs are software accessible, allowing the operating system and runtime to save and restore this state. In particular, LogTM-SE allows cache victimization, unbounded nesting (both open and closed), thread context switching and migration, and paging

384 citations


Proceedings ArticleDOI
10 Feb 2007
TL;DR: Results show that feedback-directed prefetching eliminates the large negative performance impact incurred on some benchmarks due to prefetcher, and it is applicable to stream-based prefetchers, global-history-buffer based delta correlation prefetchery, and PC-based stridePrefetchers.
Abstract: High performance processors employ hardware data prefetching to reduce the negative performance impact of large main memory latencies. While prefetching improves performance substantially on many programs, it can significantly reduce performance on others. Also, prefetching can significantly increase memory bandwidth requirements. This paper proposes a mechanism that incorporates dynamic feedback into the design of the prefetcher to increase the performance improvement provided by prefetching as well as to reduce the negative performance and bandwidth impact of prefetching. Our mechanism estimates prefetcher accuracy, prefetcher timeliness, and prefetcher-caused cache pollution to adjust the aggressiveness of the data prefetcher dynamically. We introduce a new method to track cache pollution caused by the prefetcher at run-time. We also introduce a mechanism that dynamically decides where in the LRU stack to insert the prefetched blocks in the cache based on the cache pollution caused by the prefetcher. Using the proposed dynamic mechanism improves average performance by 6.5% on 17 memory-intensive benchmarks in the SPEC CPU2000 suite compared to the best-performing conventional stream-based data prefetcher configuration, while it consumes 18.7% less memory bandwidth. Compared to a conventional stream-based data prefetcher configuration that consumes similar amount of memory bandwidth, feedback directed prefetching provides 13.6% higher performance. Our results show that feedback-directed prefetching eliminates the large negative performance impact incurred on some benchmarks due to prefetching, and it is applicable to stream-based prefetchers, global-history-buffer based delta correlation prefetchers, and PC-based stride prefetchers

342 citations


Journal ArticleDOI
TL;DR: It is demonstrated that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied.
Abstract: We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in terms of network traffic, bank accesses, and external memory accesses.

319 citations


Proceedings ArticleDOI
21 Mar 2007
TL;DR: The design and implementation of a scheme to schedule threads based on sharing patterns detected online using features of standard performance monitoring units (PMUs) available in today's processing units are described and reductions in cross-chip cache accesses are demonstrated.
Abstract: The major chip manufacturers have all introduced chip multiprocessing (CMP) and simultaneous multithreading (SMT) technology into their processing units. As a result, even low-end computing systems and game consoles have become shared memory multiprocessors with L1 and L2 cache sharing within a chip. Mid- and large-scale systems will have multiple processing chips and hence consist of an SMP-CMP-SMT configuration with non-uniform data sharing overheads. Current operating system schedulers are not aware of these new cache organizations, and as a result, distribute threads across processors in a way that causes many unnecessary, long-latency cross-chip cache accesses.In this paper we describe the design and implementation of a scheme to schedule threads based on sharing patterns detected online using features of standard performance monitoring units (PMUs) available in today's processing units. The primary advantage of using the PMU infrastructure is that it is fine-grained (down to the cache line) and has relatively low overhead. We have implemented our scheme in Linux running on an 8-way Power5 SMP-CMP-SMT multi-processor. For commercial multithreaded server workloads (VolanoMark, SPECjbb, and RUBiS), we are able to demonstrate reductions in cross-chip cache accesses of up to 70%. These reductions lead to application-reported performance improvements of up to 7%.

289 citations


Proceedings ArticleDOI
17 Jun 2007
TL;DR: For workloads that can benefit from cache partitioning, CCP achieves up to 60%, and on average 12%, better performance than the exhaustive search of optimal static partitions, and provides the best results on almost all evaluation metrics for different cache sizes.
Abstract: This paper presents Cooperative Cache Partitioning (CCP) to allocate cache resources among threads concurrently running on CMPs. Unlike cache partitioning schemes that use a single spatial partition repeatedly throughout a stable program phase, CCP resolves cache contention with multiple time-sharing partitions. Timesharing cache resources among partitions allows each thrashing thread to speed up dramatically in at least one partition by unfairly shrinking other threads' capacity allocations, while improving fairness by giving different partitions equal chance to execute. Quality-of-Service (QoS) is guaranteed over the long term by orchestrating the shrink and expansion of each thread's capacity across partitions to bound the average slowdown. Time-sharing based cache partitioning is further integrated with CMP cooperative caching [6] to exploit the benefits of LRU-based latency optimizations, which leads to a simplified partitioning algorithm and better performance for workloads that do not benefit from cache partitioning. We evaluate the effectiveness of CCP by simulating a 4-core CMP running all combinations of 7 representative SPEC2000 benchmarks. For workloads that can benefit from cache partitioning, CCP achieves up to 60%, and on average 12%, better performance than the exhaustive search of optimal static partitions. Overall, CCP provides the best results on almost all evaluation metrics for different cache sizes.

280 citations


Patent
06 Dec 2007
TL;DR: In this article, an apparatus, system, and method for solid-state storage as cache for high-capacity, non-volatile storage is described. But the system is based on a cache front-end and a cache back-end.
Abstract: An apparatus, system, and method are disclosed for solid-state storage as cache for high-capacity, non-volatile storage. The apparatus, system, and method are provided with a plurality of modules including a cache front-end module and a cache back-end module. The cache front-end module manages data transfers associated with a storage request. The data transfers between a requesting device and solid-state storage function as cache for one or more HCNV storage devices, and the data transfers may include one or more of data, metadata, and metadata indexes. The solid-state storage may include an array of non-volatile, solid-state data storage elements. The cache back-end module manages data transfers between the solid-state storage and the one or more HCNV storage devices.

238 citations


Proceedings ArticleDOI
15 Sep 2007
TL;DR: A new operating system scheduling algorithm that improves performance isolation on chip multiprocessors (CMP) by ensuring that the application runs as quickly as it would under fair cache allocation, regardless of how the cache is actually allocated.
Abstract: We describe a new operating system scheduling algorithm that improves performance isolation on chip multiprocessors (CMP). Poor performance isolation occurs when an application's performance is determined by the behaviour of its co-runners, i.e., other applications simultaneously running with it. This performance dependency is caused by unfair, co- runner-dependent cache allocation on CMPs. Poor performance isolation interferes with the operating system 's control over priority enforcement and hinders QoS provisioning. Previous solutions required modifications to the hardware. We present a new software solution. Our cache-fair algorithm ensures that the application runs as quickly as it would under fair cache allocation, regardless of how the cache is actually allocated. If the thread executes fewer instructions per cycle than it would under fair cache allocation, the scheduler increases that thread's CPU time slice. This way, the thread's overall performance does not suffer because it is allowed to use the CPU longer. We describe our implementation of the algorithm in Solaristrade 10, and show that it significantly improves performance isolation for SPEC CPU, SPEC JBB and TPC-C.

232 citations


Patent
23 Mar 2007
TL;DR: A cache includes an object cache layer and a byte cache layer, each configured to store information to storage devices included in the cache appliance as mentioned in this paper, and an application proxy layer may also be included.
Abstract: A cache includes an object cache layer and a byte cache layer, each configured to store information to storage devices included in the cache appliance. An application proxy layer may also be included. In addition, the object cache layer may be configured to identify content that should not be cached by the byte cache layer, which itself may be configured to compress contents of the object cache layer. In some cases the contents of the byte cache layer may be stored as objects within the object cache.

Journal ArticleDOI
TL;DR: This work presents the first quantitative, analytical results for the predictability of replacement policies, and introduces three metrics, evict, fill, and mls that capture aspects of cache-state predictability.
Abstract: Hard real-time systems must obey strict timing constraints. Therefore, one needs to derive guarantees on the worst-case execution times of a system's tasks. In this context, predictable behavior of system components is crucial for the derivation of tight and thus useful bounds. This paper presents results about the predictability of common cache replacement policies. To this end, we introduce three metrics, evict, fill, and mls that capture aspects of cache-state predictability. A thorough analysis of the LRU, FIFO, MRU, and PLRU policies yields the respective values under these metrics. To the best of our knowledge, this work presents the first quantitative, analytical results for the predictability of replacement policies. Our results support empirical evidence in static cache analysis.

Journal ArticleDOI
TL;DR: This work shows that the standard hash join algorithm/or disk-oriented databases (i.e. GRACE) spends over 73% of its user time stalled on CPU cache misses, and explores the use of prefetching to improve its cache performance.
Abstract: Hash join algorithms suffer from extensive CPU cache stalls. This article shows that the standard hash join algorithm for disk-oriented databases (i.e. GRACE) spends over 80p of its user time stalled on CPU cache misses, and explores the use of CPU cache prefetching to improve its cache performance. Applying prefetching to hash joins is complicated by the data dependencies, multiple code paths, and inherent randomness of hashing. We present two techniques, group prefetching and software-pipelined prefetching, that overcome these complications. These schemes achieve 1.29--4.04X speedups for the join phase and 1.37--3.49X speedups for the partition phase over GRACE and simple prefetching approaches. Moreover, compared with previous cache-aware approaches (i.e. cache partitioning), the schemes are at least 36p faster on large relations and do not require exclusive use of the CPU cache to be effective. Finally, comparing the elapsed real times when disk I/Os are in the picture, our cache prefetching schemes achieve 1.12--1.84X speedups for the join phase and 1.06--1.60X speedups for the partition phase over the GRACE hash join algorithm.

Proceedings ArticleDOI
09 Jun 2007
TL;DR: This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.
Abstract: In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3--1.6X performance improvement relative to WS for several fine-grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.

Patent
22 Feb 2007
TL;DR: In this article, cache coherency circuitry ensures that data accessed by each processing unit is up-to-date and has snoop indication circuitry whose content is derived from the already-provided segment filtering data.
Abstract: Each of plural processing units has a cache, and each cache has indication circuitry containing segment filtering data. The indication circuitry responds to an address specified by an access request from an associated processing unit to reference the segment filtering data to indicate whether the data is either definitely not stored or is potentially stored in that segment. Cache coherency circuitry ensures that data accessed by each processing unit is up-to-date and has snoop indication circuitry whose content is derived from the already-provided segment filtering data. For certain access requests, the cache coherency circuitry initiates a coherency operation during which the snoop indication circuitry determines whether any of the caches requires a snoop operation. For each cache that does, the cache coherency circuitry issues a notification to that cache identifying the snoop operation to be performed.

Proceedings Article
17 Jun 2007
TL;DR: The design shows that when the VM evicts pages in the LRU order, the employment of the hypervisor cache does not introduce any additional I/O overhead in the system and the proposed scheme was implemented on the Xen para-virtualization platform.
Abstract: Virtual machine (VM) memory allocation and VM consolidation can benefit from the prediction of VM page miss rate at each candidate memory size. Such prediction is challenging for the hypervisor (or VM monitor) due to a lack of knowledge on VM memory access pattern. This paper explores the approach that the hypervisor takes over the management for part of the VM memory and thus all accesses that miss the remaining VM memory can be transparently traced by the hypervisor. For online memory access tracing, its overhead should be small compared to the case that all allocated memory is directly managed by the VM. To save memory space, the hypervisor manages its memory portion as an exclusive cache (i.e., containing only data that is not in the remaining VM memory). To minimize I/O overhead, evicted data from a VM enters its cache directly from VM memory (as opposed to entering from the secondary storage). We guarantee the cache correctness by only caching memory pages whose current contents provably match those of corresponding storage locations. Based on our design, we show that when the VM evicts pages in the LRU order, the employment of the hypervisor cache does not introduce any additional I/O overhead in the system. We implemented the proposed scheme on the Xen para-virtualization platform. Our experiments with microbenchmarks and four real data-intensive services (SPECweb99, index searching, TPC-C, and TPC-H) illustrate the overhead of our hypervisor cache and the accuracy of cache-driven VM page miss rate prediction. We also present the results on adaptive VM memory allocation with performance assurance.

Proceedings ArticleDOI
10 Feb 2007
TL;DR: This work proposes a novel non-uniform cache architecture in which the amount of cache space that can be shared among the cores is controlled dynamically and shows that this scheme outperforms a private and shared cache organization as well as a hybrid NUCA organization in which blocks in a local partition can spill over to neighbor core partitions.
Abstract: The significant speed-gap between processor and memory and the limited chip memory bandwidth make last-level cache performance crucial for future chip multiprocessors. To use the capacity of shared last-level caches efficiently and to allow for a short access time, proposed non-uniform cache architectures (NUCAs) are organized into per-core partitions. If a core runs out of cache space, blocks are typically relocated to nearby partitions, thus managing the cache as a shared cache. This uncontrolled sharing of all resources may unfortunately result in pollution that degrades performance. We propose a novel non-uniform cache architecture in which the amount of cache space that can be shared among the cores is controlled dynamically. The adaptive scheme estimates, continuously, the effect of increasing/decreasing the shared partition size on the overall performance. We show that our scheme outperforms a private and shared cache organization as well as a hybrid NUCA organization in which blocks in a local partition can spill over to neighbor core partitions

Proceedings ArticleDOI
09 Jun 2007
TL;DR: This paper proposes OneTM to simplify the implementation of unbounded transactional memory by bounding the concurrency of transactions that overflow the cache, and introduces the permissions-only cache to extend the bound at which transactions overflow to allow the fast, bounded case to be used as frequently as possible.
Abstract: Hardware transactional memory has great potential to simplify the creation ofcorrect and efficient multithreaded programs, allowing programmers to exploitmore effectively the soon-to-be-ubiquitous multi-core designs. Several recentproposals have extended the original bounded transactional memory to unboundedtransactional memory, a crucial step toward transactions becoming ageneral-purpose primitive. Unfortunately, supporting the concurrent executionof an unbounded number of unbounded transactions is challenging, and as aresult, many proposed implementations are complex.This paper explores a different approach. First, we introduce thepermissions-only cache to extend the bound at which transactions overflow toallow the fast, bounded case to be used as frequently as possible. Second, wepropose OneTM to simplify the implementation of unbounded transactional memoryby bounding the concurrency of transactions that overflow the cache. Thesemechanisms work synergistically to provide a simple and fast unboundedtransactional memory system.The permissions-only cache efficiently maintains the coherencepermissions-but not data-for blocks read or written transactionally thathave been evicted from the processor's caches. By holding coherencepermissions for these blocks, the regular cache coherence protocol can be usedto detect transactional conflicts using only a few bits of on-chip storage peroverflowed cache block.OneTM allows only one overflowed transaction at a time, relying on thepermissions-only cache to ensure that overflow is infrequent. We present twoimplementations. In OneTM-Serialized, an overflowed transaction simply stallsall other threads in the application.In OneTM-Concurrent, non-overflowedtransactions and non-transactional code can execute concurrently with theoverflowed transaction, providing more concurrency while retaining OneTM's coresimplifying assumption.

Proceedings ArticleDOI
01 Oct 2007
TL;DR: This work proposes to directly predict reuse-distances via instruction-based (PC) prediction and use this information for cache level optimizations and evaluates the reusedistance based replacement policy of the L2 cache using a subset of the most memory intensive SPEC2000.
Abstract: Several cache management techniques have been proposed that indirectly try to base their decisions on cacheline reuse-distance, like Cache Decay which is a postdiction of reuse-distances: if a cacheline has not been accessed for some ldquodecay intervalrdquo we know that its reuse-distance is at least as large as this decay interval. In this work, we propose to directly predict reuse-distances via instruction-based (PC) prediction and use this information for cache level optimizations. In this paper, we choose as our target for optimization the replacement policy of the L2 cache, because the gap between the LRU and the theoretical optimal replacement algorithm is comparatively large for L2 caches. This indicates that, in many situations, there is ample room for improvement. We evaluate our reusedistance based replacement policy using a subset of the most memory intensive SPEC2000 and our results show significant benefits across the board.

Proceedings ArticleDOI
01 Dec 2007
TL;DR: The tagless hit instruction cache (TH-IC) is proposed, a technique for completely eliminating the performance penalty associated with filter caches, as well as a further reduction in energy consumption due to not having to access the tag array on cache hits.
Abstract: Very small instruction caches have been shown to greatly reduce fetch energy. However, for many appli- cations the use of a small filter cache can lead to an unacceptable increase in execution time. In this paper, we propose the Tagless Hit Instruction Cache (TH-IC), a technique for completely eliminating the performance penalty associated with filter caches, as well as a fur- ther reduction in energy consumption due to not having to access the tag array on cache hits. Using a few meta- data bits per line, we are able to more efficiently track the cache contents and guarantee when hits will occur in our small TH-IC. When a hit is not guaranteed, we can instead fetch directly from the L1 instruction cache, eliminating any additional cycles due to a TH-IC miss. Experimental results show that the overall processor en- ergy consumption can be significantly reduced due to the faster application running time and the elimination of tag comparisons for most of the accesses.

Proceedings ArticleDOI
09 Jun 2007
TL;DR: Overall, the results indicate that there is not sufficient advantage in building streaming memory systems where all on-chip memory structures are explicitly managed, and it is shown that streaming at the programming model level is particularly beneficial, even with the cache-based model, as it enhances locality and creates opportunities for bandwidth optimizations.
Abstract: There are two basic models for the on-chip memory in CMP systems:hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two modelsunder the same set of assumptions about technology, area, and computational capabilities. The goal is to quantify how and when they differ in terms of performance, energy consumption, bandwidth requirements, and latency tolerance for general-purpose CMPs. We demonstrate that for data-parallel applications, the cache-based and streaming models perform and scale equally well. For certain applications with little data reuse, streaming scales better due to better bandwidth use and macroscopic software prefetching. However, the introduction of techniques such as hardware prefetching and non-allocating stores to the cache-based model eliminates the streaming advantage. Overall, our results indicate that there is not sufficient advantage in building streaming memory systems where all on-chip memory structures are explicitly managed. On the other hand, we show that streaming at the programming model level is particularly beneficial, even with the cache-based model, as it enhances locality and creates opportunities for bandwidth optimizations. Moreover, we observe that stream programming is actually easier with the cache-based model because the hardware guarantees correct, best-effort execution even when the programmer cannot fully regularize an application's code.

Patent
30 Jul 2007
TL;DR: In this paper, a mechanism for selectively disabling and enabling read caching based on past performance of the cache and current read/write requests is proposed to improve overall performance by using an autonomic algorithm to disable read caching.
Abstract: A mechanism for selectively disabling and enabling read caching based on past performance of the cache and current read/write requests. The system improves overall performance by using an autonomic algorithm to disable read caching for regions of backend disk storage (i.e., the backstore) that have had historically low cache hit ratios. The result is that more cache becomes available for workloads with larger hit ratios, and less time and machine cycles are spent searching the cache for data that is unlikely to be there.

Patent
25 May 2007
TL;DR: In this article, a multicore processor comprises a plurality of cache memories, each associated with one cache memory, and each of the cache memories is configured to maintain at least a portion of cache memory in which each cache line is dynamically managed as either local to the associated processor core or shared among multiple processor cores.
Abstract: A multicore processor comprises a plurality of cache memories, and a plurality of processor cores, each associated with one of the cache memories. Each of at least some of the cache memories is configured to maintain at least a portion of the cache memory in which each cache line is dynamically managed as either local to the associated processor core or shared among multiple processor cores.

Proceedings ArticleDOI
09 Jun 2007
TL;DR: This work extends the widely-used CACTI cache modeling tool to take network design parameters into account and proposes novel cache access optimizations that introduce heterogeneity within the inter-bank network to alleviate the interconnect delay bottleneck.
Abstract: The ever increasing sizes of on-chip caches and the growing domination of wire delay necessitate significant changes to cache hierarchy design methodologies Many recent proposals advocate splitting the cache into a large number of banks and employing a network-on-chip (NoC) to allow fast access to nearby banks (referred to as Non-Uniform Cache Architectures--NUCA) Most studies on NUCA organizations have assumed a generic NoC and focused on logical policies for cache block placement, movement, and search Since wire/router delay and power are major limiting factors in modern processors, this work focuses on interconnect design and its influence on NUCA performance and power We extend the widely-used CACTI cache modeling tool to take network design parameters into account With these overheads appropriately accounted for, the optimal cache organization is typically very different from that assumed in prior NUCA studies To alleviate the interconnect delay bottleneck, we propose novel cache access optimizations that introduce heterogeneity within the inter-bank network The careful consideration of interconnect choices for a large cache results in a 51% performance improvement over a baseline generic NoC and the introduction of heterogeneity within the network yields an additional 11-15% performance improvement

Patent
27 Dec 2007
TL;DR: In this paper, a method and apparatus for providing priority aware and consumption guided dynamic probabilistic allocation for a cache memory is described, where allocation probabilities for each priority level are updated based on the measured consumption/utilization.
Abstract: A method and apparatus for is herein described providing priority aware and consumption guided dynamic probabilistic allocation for a cache memory. Utilization of a sample size of a cache memory is measured for each priority level of a computer system. Allocation probabilities for each priority level are updated based on the measured consumption/utilization, i.e. allocation is reduced for priority levels consuming too much of the cache and allocation is increased for priority levels consuming too little of the cache. In response to an allocation request, it is assigned a priority level. An allocation probability associated with the priority level is compared with a randomly generated number. If the number is less than the allocation probability, then a fill to the cache is performed normally. In contrast, a spatially or temporally limited fill is performed if the random number is greater than the allocation probability.

Journal ArticleDOI
TL;DR: The 16-way set associative, single-ported 16-MB cache for the Dual-Core Intel Xeon Processor 7100 Series uses a 0.624 mum2 cell in a 65-nm 8-metal technology to minimize both leakage and dynamic power.
Abstract: The 16-way set associative, single-ported 16-MB cache for the Dual-Core Intel Xeon Processor 7100 Series uses a 0.624 mum2 cell in a 65-nm 8-metal technology. Low power techniques are implemented in the L3 cache to minimize both leakage and dynamic power. Sleep transistors are used in the SRAM array and peripherals, reducing the cache leakage by more than 2X. Only 0.8% of the cache is powered up for a cache access. Dynamic cache line disable (Intel Cache Safe Technology) with a history buffer protects the cache from latent defects and infant mortality failures

Patent
Menahem Lasser1
10 Jan 2007
TL;DR: A flash memory device includes a storage area having a main memory portion and a cache memory portion storing at least one bit per cell less than the main memory, and a controller that manages data transfer between the cache memory and main memory according to a caching command received from a host.
Abstract: A flash memory device includes a storage area having a main memory portion and a cache memory portion storing at least one bit per cell less than the main memory portion; and a controller that manages data transfer between the cache memory portion and the main memory portion according to at least one caching command received from a host The management of data transfer, by the controller, includes transferring new data from the host to the cache memory portion, copying the data from the cache memory portion to the main memory portion and controlling (enabling/disabling) the scheduling of cache cleaning operations

Proceedings ArticleDOI
01 Dec 2007
TL;DR: This work proposes a novel replacement strategy that mimics the replacement decisions of OPT and can cover 40% of the gap between OPT and LRU for a 2MB cache resulting in 7% overall speedup.
Abstract: The inherent temporal locality in memory accesses is filtered out by the L1 cache. As a consequence, an L2 cache with LRU replacement incurs significantly higher misses than the optimal replacement policy (OPT). We propose to narrow this gap through a novel replacement strategy that mimics the replacement decisions of OPT. The L2 cache is logically divided into two components, a Shepherd Cache (SC) with a simple FIFO replacement and a Main Cache (MC) with an emulation of optimal replacement. The SC plays the dual role of caching lines and guiding the replacement decisions in MC. Our pro- posed organization can cover 40% of the gap between OPT and LRU for a 2MB cache resulting in 7% overall speedup. Comparison with the dynamic insertion policy, a victim buffer, a V-Way cache and an LRU based fully associative cache demonstrates that our scheme performs better than all these strategies.

Proceedings Article
13 Feb 2007
TL;DR: Karma is presented, a global non-centralized, dynamic and informed management policy for multiple levels of cache that leverages application hints to make informed allocation and replacement decisions in all cache levels, preserving exclusive caching and adjusting to changes in access patterns.
Abstract: Multilevel caching, common in many storage configurations, introduces new challenges to traditional cache management: data must be kept in the appropriate cache and replication avoided across the various cache levels Some existing solutions focus on avoiding replication across the levels of the hierarchy, working well without information about temporal locality-information missing at all but the highest level of the hierarchy Others use application hints to influence cache contents We present Karma, a global non-centralized, dynamic and informed management policy for multiple levels of cache Karma leverages application hints to make informed allocation and replacement decisions in all cache levels, preserving exclusive caching and adjusting to changes in access patterns We show the superiority of Karma through comparison to existing solutions including LRU, 2Q, ARC, MultiQ, LRU-SP, and Demote, demonstrating better cache performance than all other solutions and up to 85% better performance than LRU on representative workloads

Proceedings ArticleDOI
01 Oct 2007
TL;DR: This paper investigates the impact of introducing a low latency, large capacity and high bandwidth DRAM-based cache between the last level SRAM cache and memory subsystem, and identifies the most efficient DRAM cache organization.
Abstract: As dual-core and quad-core processors arrive in the marketplace, the momentum behind CMP architectures continues to grow strong. As more and more cores/threads are placed on-die, the pressure on the memory subsystem is rapidly increasing. To address this issue, we explore DRAM cache architectures for CMP platforms. In this paper, we investigate the impact of introducing a low latency, large capacity and high bandwidth DRAM-based cache between the last level SRAM cache and memory subsystem. We first show the potential benefits of large DRAM caches for key commercial server workloads. As the primary hurdle to achieving these benefits with DRAM caches is the tag space overheads associated with them, we identify the most efficient DRAM cache organization and investigate various options. Our results show that the combination of 8-bit partial tags and 2-way sectoring achieves the highest performance (20% to 70%) with the lowest tag space (<25%) overhead.