scispace - formally typeset
Search or ask a question

Showing papers on "Smart Cache published in 2010"


Proceedings ArticleDOI
19 Jun 2010
TL;DR: This paper proposes Static RRIP that is scan-resistant and Dynamic RRIP (DRRIP) that is both scan- resistant and thrash-resistant that require only 2-bits per cache block and easily integrate into existing LRU approximations found in modern processors.
Abstract: Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and misses. Applications that exhibit a distant re-reference interval perform badly under LRU. Such applications usually have a working-set larger than the cache or have frequent bursts of references to non-temporal data (called scans). To improve the performance of such workloads, this paper proposes cache replacement using Re-reference Interval Prediction (RRIP). We propose Static RRIP (SRRIP) that is scan-resistant and Dynamic RRIP (DRRIP) that is both scan-resistant and thrash-resistant. Both RRIP policies require only 2-bits per cache block and easily integrate into existing LRU approximations found in modern processors. Our evaluations using PC games, multimedia, server and SPEC CPU2006 workloads on a single-core processor with a 2MB last-level cache (LLC) show that both SRRIP and DRRIP outperform LRU replacement on the throughput metric by an average of 4% and 10% respectively. Our evaluations with over 1000 multi-programmed workloads on a 4-core CMP with an 8MB shared LLC show that SRRIP and DRRIP outperform LRU replacement on the throughput metric by an average of 7% and 9% respectively. We also show that RRIP outperforms LFU, the state-of the art scan-resistant replacement algorithm to-date. For the cache configurations under study, RRIP requires 2X less hardware than LRU and 2.5X less hardware than LFU.

715 citations


Journal ArticleDOI
TL;DR: Power Systems™ continue strong 7th Generation Power chip: Balanced Multi-Core design EDRAM technology SMT4 greater then 4X performance in same power envelope as previous generation.
Abstract: The Power7 is IBM's first eight-core processor, with each core capable of four-way simultaneous-multithreading operation. Its key architectural features include an advanced memory hierarchy with three levels of on-chip cache; embedded-DRAM devices used in the highest level of the cache; and a new memory interface. This balanced multicore design scales from 1 to 32 sockets in commercial and scientific environments.

259 citations


Proceedings ArticleDOI
04 Dec 2010
TL;DR: The zcache is presented, a cache design that allows much higher associativity than the number of physical ways, and it is shown that zcaches provide higher performance and better energy efficiency than conventional caches without incurring the overheads of designs with a large number of ways.
Abstract: The ever-increasing importance of main memory latency and bandwidth is pushing CMPs towards caches with higher capacity and associativity. Associativity is typically improved by increasing the number of ways. This reduces conflict misses, but increases hit latency and energy, placing a stringent trade-off on cache design. We present the zcache, a cache design that allows much higher associativity than the number of physical ways (e.g. a 64-associative cache with 4 ways). The zcache draws on previous research on skew-associative caches and cuckoo hashing. Hits, the common case, require a single lookup, incurring the latency and energy costs of a cache with a very low number of ways. On a miss, additional tag lookups happen off the critical path, yielding an arbitrarily large number of replacement candidates for the incoming block. Unlike conventional designs, the zcache provides associativity by increasing the number of replacement candidates, but not the number of cache ways. To understand the implications of this approach, we develop a general analysis framework that allows to compare associativity across different cache designs (e.g. a set-associative cache and a zcache) by representing associativity as a probability distribution. We use this framework to show that for zcaches, associativity depends only on the number of replacement candidates, and is independent of other factors (such as the number of cache ways or the workload). We also show that, for the same number of replacement candidates, the associativity of a zcache is superior than that of a set-associative cache for most workloads. Finally, we perform detailed simulations of multithreaded and multiprogrammed workloads on a large-scale CMP with zcache as the last-level cache. We show that zcaches provide higher performance and better energy efficiency than conventional caches without incurring the overheads of designs with a large number of ways.

203 citations


Patent
07 Jul 2010
TL;DR: In this paper, a security module on a computing device applies security rules to examine content in a network cache and identify suspicious cache content, such as a rule determining whether the cache content is associated with modified-time set into the future, and a rule deciding whether cache content was created in a low-security environment.
Abstract: A security module on a computing device applies security rules to examine content in a network cache and identify suspicious cache content. Cache content is identified as suspicious according to security rules, such as a rule determining whether the cache content is associated with modified-time set into the future, and a rule determining whether the cache content was created in a low-security environment. The security module may establish an out-of-band connection with the websites from which the cache content originated through a high security access network to receive responses from the websites, and use the responses to determine whether the cache content is suspicious cache content. Suspicious cache content is removed from the network cache to prevent the suspicious cache content from carrying out malicious activities.

168 citations


Patent
25 Mar 2010
TL;DR: In this article, a method for optimising the distribution of data objects between caches in a cache domain of a resource limited network is described, where object information including the request frequency of each requested data object and the locations of the caches at which the requests were received is collated and stored.
Abstract: There is described a method for optimising the distribution of data objects between caches in a cache domain of a resource limited network. User requests for data objects are received at caches in the cache domain. A notification is sent from each cache at which a request is received to a cache manager. The notification reports the user request and identifies the requested data object. At the cache manager, object information including the request frequency of each requested data object and the locations of the caches at which the requests were received is collated and stored. At the cache manager, objects for distribution within the cache domain are identified on the basis of the object information. Instructions are sent from the cache manager to the caches to distribute data objects stored in those caches between themselves. The objects are classified into classes according to popularity, the classes including a high popularity class comprising objects which should be distributed to all caches in the cache domain, a medium popularity class comprising objects which should be distributed to a subset of the caches in the cache domain, and a low popularity class comprising objects which should not be distributed.

137 citations


Proceedings ArticleDOI
19 Jun 2010
TL;DR: A new, cost-effective architecture - SieveStore - which enables the use of solid-state media to significantly filter access to storage ensembles and which achieves significantly higher hit ratios while using only 1/7th the number of SSD drives.
Abstract: Emerging solid-state storage media can significantly improve storage performance and energy. However, the high cost-per-byte of solid-state media has hindered wide-spread adoption in servers. This paper proposes a new, cost-effective architecture - SieveStore - which enables the use of solid-state media to significantly filter access to storage ensembles. Our paper makes three key contributions. First, we make a case for highly-selective, storage-ensemble-level disk-block caching based on the highly-skewed block popularity distribution and based on the dynamic nature of the popular block set. Second, we identify the problem of allocation-writes and show that selective cache allocation to reduce allocation-writes - sieving - is fundamental to enable efficient ensemble-level disk-caching. Third, we propose two practical variants of SieveStore. Based on week-long block access traces from a storage ensemble of 13 servers, we find that the two components (sieving and ensemble-level caching) each contribute to SieveStore's cost-effectiveness. Compared to unsieved, ensemble-level disk-caches, SieveStore achieves significantly higher hit ratios (35%-50% more, on average) while using only 1/7th the number of SSD drives. Further, ensemble-level caching is strictly better in cost-performance compared to per-server caching.

129 citations


Patent
30 Jul 2010
TL;DR: In this paper, an apparatus, system, and method for redundant write caching is described. But the authors do not specify a hardware implementation of the redundancy mechanism, only a plurality of modules, including a write request module, a first cache write module, and a second cache write modules.
Abstract: An apparatus, system, and method are disclosed for redundant write caching. The apparatus, system, and method are provided with a plurality of modules including a write request module, a first cache write module, a second cache write module, and a trim module. The write request module detects a write request to store data on a storage device. The first cache write module writes data of the write request to a first cache. The second cache write module writes the data to a second cache. The trim module trims the data from one of the first cache and the second cache in response to an indicator that the storage device stores the data. The data remains available in the other of the first cache and the second cache to service read requests.

128 citations


Patent
Kenneth P. Der1
28 Sep 2010
TL;DR: In this article, the authors propose a technique to protect host data by storing the block of data as a dirty cache block in a local cache of the local computerized node and performing a set of external caching operations to cache the set of sub-blocks.
Abstract: A technique protects host data. The technique involves receiving, at a local computerized node, a block of data from a host computer, the block of data including data sub-blocks. The technique further involves storing the block of data, as a dirty cache block, in a local cache of the local computerized node. The technique further involves performing a set of external caching operations to cache a set of sub-blocks in a set of external computerized nodes in communication with the local computerized node. Each external caching operation caches a respective sub-block of the set of sub-blocks in a cache of a respective external computerized node. The set of sub-blocks includes (i) the data sub-blocks of the block of data from the host and (ii) a set of checksums derived from the data sub-blocks of the block of data from the host.

124 citations


Proceedings ArticleDOI
09 Jan 2010
TL;DR: A systematic measurement of the influence of CMP cache sharing on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered shows some surprising results.
Abstract: Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention.A number of studies have examined the influence of cache sharing on multithreaded applications, but most of them have concentrated on the design or management of shared cache, rather than a systematic measurement of the influence. Consequently, prior measurements have been constrained by the reliance on simulators, the use of out-of-date benchmarks, and the limited coverage of deciding factors. The influence of CMP cache sharing on contemporary multithreaded applications remains preliminarily understood.In this work, we conduct a systematic measurement of the influence on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered. The measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions, regardless of the types of parallelism, input datasets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch of current development and compilation of multithreaded applications and CMP architectures. By transforming the programs in a cache-sharing-aware manner, we observe up to 36% performance increase when the threads are placed on cores appropriately.

123 citations


Patent
12 Aug 2010
TL;DR: In this paper, a distributed data cache is used to expedite the retrieval of data for application execution by a server in a content delivery network, where the distributed cache is distributed across computer-readable storage media included in a plurality of servers.
Abstract: A distributed data cache included in a content delivery network expedites retrieval of data for application execution by a server in a content delivery network. The distributed data cache is distributed across computer-readable storage media included in a plurality of servers in the content delivery network. When an application generates a query for data, a server in the content delivery network determines whether the distributed data cache includes data associated with the query. If data associated with the query is stored in the distributed data cache, the data is retrieved from the distributed data cache. If the distributed data cache does not include data associated with the query, the data is retrieved from a database and the query and associated data are stored in the distributed data cache to expedite subsequent retrieval of the data when the application issues the same query.

120 citations


Proceedings ArticleDOI
28 Mar 2010
TL;DR: CAMP estimates the performance degradation due to cache contention of processes running on CMPs and provides an automated way to obtain process-dependent characteristics, such as reuse distance histograms, without offline simulation, operating system modification, or additional hardware.
Abstract: The ongoing move to chip multiprocessors (CMPs) permits greater sharing of last-level cache by processor cores but this sharing aggravates the cache contention problem, potentially undermining performance improvements. Accurately modeling the impact of inter-process cache contention on performance and power consumption is required for optimized process assignment. However, techniques based on exhaustive consideration of process-to-processor mappings and cycle-accurate simulation are inefficient or intractable for CMPs, which often permit a large number of potential assignments. This paper proposes CAMP, a fast and accurate shared cache aware performance model for multi-core processors. CAMP estimates the performance degradation due to cache contention of processes running on CMPs. It uses reuse distance histograms, cache access frequencies, and the relationship between the throughput and cache miss rate of each process to predict its effective cache size when running concurrently and sharing cache with other processes, allowing instruction throughput estimation.We also provide an automated way to obtain process-dependent characteristics, such as reuse distance histograms, without offline simulation, operating system (OS) modification, or additional hardware. We tested the accuracy of CAMP using 55 different combinations of 10 SPEC CPU2000 benchmarks on a dual-core CMP machine. The average throughput prediction error was 1.57%.

Patent
15 Dec 2010
TL;DR: In this paper, a multi-tiered cache manager and methods for managing multi-tier cache are described, which causes cached data to be initially stored in RAM elements and selects portions of the cached data stored in the RAM elements to be moved to the flash elements.
Abstract: A multi-tiered cache manager and methods for managing multi-tiered cache are described. Multi-tiered cache manager causes cached data to be initially stored in the RAM elements and selects portions of the cached data stored in the RAM elements to be moved to the flash elements. Each flash element is organized as a plurality of write blocks having a block size and wherein a predefined maximum number of writes is permitted to each write block. The portions of the cached data may be selected based on a maximum write rate calculated from the maximum number of writes allowed for the flash device and a specified lifetime of the cache system.

Journal ArticleDOI
TL;DR: The design and implementation of cooperative cache in wireless P2P networks are presented, and a novel asymmetric cooperative cache approach is proposed, where the data requests are transmitted to the cache layer on every node, but the data replies are only transmitted toThe cache layer at the intermediate nodes that need to cache the data.
Abstract: Some recent studies have shown that cooperative cache can improve the system performance in wireless P2P networks such as ad hoc networks and mesh networks. However, all these studies are at a very high level, leaving many design and implementation issues unanswered. In this paper, we present our design and implementation of cooperative cache in wireless P2P networks, and propose solutions to find the best place to cache the data. We propose a novel asymmetric cooperative cache approach, where the data requests are transmitted to the cache layer on every node, but the data replies are only transmitted to the cache layer at the intermediate nodes that need to cache the data. This solution not only reduces the overhead of copying data between the user space and the kernel space, it also allows data pipelines to reduce the end-to-end delay. We also study the effects of different MAC layers, such as 802.11-based ad hoc networks and multi-interface-multichannel-based mesh networks, on the performance of cooperative cache. Our results show that the asymmetric approach outperforms the symmetric approach in traditional 802.11-based ad hoc networks by removing most of the processing overhead. In mesh networks, the asymmetric approach can significantly reduce the data access delay compared to the symmetric approach due to data pipelines.

Patent
20 Dec 2010
TL;DR: Cache management techniques for a content distribution network (CDN), for example, a video on demand (VOD) system supporting user requests and delivery of video content, are described in this paper.
Abstract: Cache management techniques are described for a content distribution network (CDN), for example, a video on demand (VOD) system supporting user requests and delivery of video content A preferred cache size may be calculated for one or more cache devices in the CDN, for example, based on a maximum cache memory size, a bandwidth availability associated with the CDN, and a title dispersion calculation determined by the user requests within the CDN After establishing the cache with a set of assets (eg, video content), an asset replacement algorithm may be executed at one or more cache devices in the CDN When a determination is made that a new asset should be added to a full cache, a multi-factor comparative analysis may be performed on the assets currently residing in the cache, comparing the popularity and size of assets and combinations of assets, along with other factors to determine which assets should be replaced in the cache device

Proceedings ArticleDOI
13 Apr 2010
TL;DR: DProf helps programmers understand cache miss costs by attributing misses to data types instead of code, and introduces a number of new views of cache miss data, including a data profile, which reports the data types with the most cache misses, and a data flow graph, which summarizes how objects of a given type are accessed throughout their lifetime, and which accesses incur expensive cross-CPU cache loads.
Abstract: Effective use of CPU data caches is critical to good performance, but poor cache use patterns are often hard to spot using existing execution profiling tools. Typical profilers attribute costs to specific code locations. The costs due to frequent cache misses on a given piece of data, however, may be spread over instructions throughout the application. The resulting individually small costs at a large number of instructions can easily appear insignificant in a code profiler's output.DProf helps programmers understand cache miss costs by attributing misses to data types instead of code. Associating cache misses with data helps programmers locate data structures that experience misses in many places in the application's code. DProf introduces a number of new views of cache miss data, including a data profile, which reports the data types with the most cache misses, and a data flow graph, which summarizes how objects of a given type are accessed throughout their lifetime, and which accesses incur expensive cross-CPU cache loads. We present two case studies of using DProf to find and fix cache performance bottlenecks in Linux. The improvements provide a 16-57% throughput improvement on a range of memcached and Apache workloads.

Patent
Scott Howard Davis1
26 Apr 2010
TL;DR: In this article, the content aware cache filter component of the hypervisor of the server is implemented in an I/O virtualization layer that sits above a file system layer, such that any file system protocol may be implemented in the file system level.
Abstract: A server supporting the implementation of virtual machines includes a local memory used for caching, such as a solid state device drive. During I/O intensive processes, such as a boot storm, a “content aware” cache filter component of the hypervisor of the server first accesses a cache structure in a content cache device to determine whether data blocks have been stored in the cache structure prior to requesting the data blocks from a networked disk array via a standard I/O stack of the hypervisor. The content aware cache filter component is implemented in an I/O virtualization layer of the standard I/O stack that sits above a file system layer of the standard I/O stack, such that any file system protocol may be implemented in the file system layer.

Proceedings ArticleDOI
11 Sep 2010
TL;DR: Increasing cache efficiency can improve performance by reducing miss rate, or alternately, improve power and energy by allowing a smaller cache with the same miss rate.
Abstract: Caches mitigate the long memory latency that limits the performance of modern processors. However, caches can be quite inefficient. On average, a cache block in a 2MB L2 cache is dead 59% of the time, i.e., it will not be referenced again before it is evicted. Increasing cache efficiency can improve performance by reducing miss rate, or alternately, improve power and energy by allowing a smaller cache with the same miss rate.This paper proposes using predicted dead blocks to hold blocks evicted from other sets. When these evicted blocks are referenced again, the access can be satisfied from the other set, avoiding a costly access to main memory. The pool of predicted dead blocks can be thought of as a virtual victim cache. For a set of memory-intensive single-threaded workloads, a virtual victim cache in a 16-way set associative 2MB L2 cache reduces misses by 26%, yields an geometric mean speedup of 12.1% and improves cache efficiency by 27% on average, where cache efficiency is defined as the average time during which cache blocks contain live information. This virtual victim cache yields a lower average miss rate than a fully-associative LRU cache of the same capacity. For a set of multi-core workloads, the virtual victim cache improves throughput performance by 4% over LRU while improving cache efficiency by 62%.Alternately, a 1.7MB virtual victim cache achieves about the same performance as a larger 2MB L2 cache, reducing the number of SRAM cells required by 16%, thus maintaining performance while reducing power and area.

Patent
Wenchu Cen1
19 May 2010
TL;DR: In this article, the cache cluster is configurable in an active cluster configuration mode, where the plurality of cache service nodes are all in working state and a master cache service node is selected among the plurality.
Abstract: Processing cache data includes sending a cache processing request to a master cache service node in a cache cluster that includes a plurality of cache service nodes, the cache cluster being configurable in an active cluster configuration mode wherein the plurality of cache service nodes are all in working state and a master cache service node is selected among the plurality of cache service nodes, or in a standby cluster configuration mode, wherein the master cache service node is the only node among the plurality of cache service nodes that is in working state. It further includes waiting for a response from the master cache service node, determining whether the master cache service node has failed; and in the event that the master cache service node has failed, selecting a backup cache service node.

Proceedings ArticleDOI
13 Jun 2010
TL;DR: An optimal algorithm and a heuristic approach that use the temporal reuse profile to determine the most beneficial memory blocks to be locked in the cache and provides significant improvement compared to the state-of-the-art locking algorithm both in terms of performance and efficiency.
Abstract: The performance of most embedded systems is critically dependent on the average memory access latency. Improving the cache hit rate can have significant positive impact on the performance of an application. Modern embedded processors often feature cache locking mechanisms that allow memory blocks to be locked in the cache under software control. Cache locking was primarily designed to offer timing predictability for hard real-time applications. Hence, the compiler optimization techniques focus on employing cache locking to improve worst-case execution time. However, cache locking can be quite effective in improving the average-case execution time of general embedded applications as well. In this paper, we explore static instruction cache locking to improve average-case program performance. We introduce temporal reuse profile to accurately and efficiently model the cost and benefit of locking memory blocks in the cache. We propose an optimal algorithm and a heuristic approach that use the temporal reuse profile to determine the most beneficial memory blocks to be locked in the cache. Experimental results show that locking heuristic achieves close to optimal results and can improve the cache miss rate by up to 24% across a suite of real-world benchmarks. Moreover, our heuristic provides significant improvement compared to the state-of-the-art locking algorithm both in terms of performance and efficiency.

Patent
17 Sep 2010
TL;DR: In this article, a cache device may comprise a non-volatile storage device configured to perform cache functions for a backing store, and a modified cache policy for the cache device in response to the risk of data loss exceeding the threshold risk level.
Abstract: Apparatuses, systems, and methods are disclosed for implementing a cache policy. A method may include determining a risk of data loss on a cache device. The cache device may comprise a non-volatile storage device configured to perform cache functions for a backing store. The cache device may implement a cache policy. A method may include determining that a risk of data loss on the cache devices exceeds a threshold risk level. A method may include implementing a modified cache policy for the cache device in response to the risk of data loss exceeding the threshold risk level. The modified cache policy may reduce the risk of data loss below the threshold level.

Proceedings ArticleDOI
19 Apr 2010
TL;DR: This paper shows several methods to keep reuse stacks consistent so that they account for invalidations and cache sharing, either as references arise in a simulated execution or at synchronization points, and shows that adding multicore-awareness substantially improves the ability of reuse distance analysis to model cache behavior.
Abstract: This paper presents and validates methods to extend reuse distance analysis of application locality characteristics to shared-memory multicore platforms by accounting for invalidation-based cache-coherence and inter-core cache sharing. Existing reuse distance analysis methods track the number of distinct addresses referenced between reuses of the same address by a given thread, but do not model the effects of data references by other threads. This paper shows several methods to keep reuse stacks consistent so that they account for invalidations and cache sharing, either as references arise in a simulated execution or at synchronization points. These methods are evaluated against a Simics-based coherent cache simulator running several OpenMP and transaction-based benchmarks. The results show that adding multicore-awareness substantially improves the ability of reuse distance analysis to model cache behavior, reducing the error in miss ratio prediction (relative to cache simulation for a specific cache size) by an average of 70% for per-core caches and an average of 90% for shared caches.

Patent
27 Jan 2010
TL;DR: In this paper, a processor may include several processor cores, each including a respective higher-level cache; a lower level cache including several tag units each including several controllers, where each controller corresponds to a respective cache bank configured to store data, and where the controllers are concurrently operable to access their respective cache banks.
Abstract: A processor may include several processor cores, each including a respective higher-level cache; a lower-level cache including several tag units each including several controllers, where each controller corresponds to a respective cache bank configured to store data, and where the controllers are concurrently operable to access their respective cache banks; and an interconnect network configured to convey data between the cores and the lower-level cache. The controllers may share access to an interconnect egress port coupled to the interconnect network, and may generate multiple concurrent requests to convey data via the shared port, where each of the requests is destined for a corresponding core, and where a datapath width of the port is less than a combined width of the multiple requests. The given tag unit may arbitrate among the controllers for access to the shared port, such that the requests are transmitted to corresponding cores serially rather than concurrently.

Patent
Jiang Lin1, Lixin Zhang1
19 Aug 2010
TL;DR: In this paper, a mechanism is provided in a virtual machine monitor for providing cache partitioning in virtualized environments, which assigns a virtual identification (ID) to each virtual machine in the virtualized environment.
Abstract: A mechanism is provided in a virtual machine monitor for providing cache partitioning in virtualized environments. The mechanism assigns a virtual identification (ID) to each virtual machine in the virtualized environment. The processing core stores the virtual ID of the virtual machine in a special register. The mechanism also creates an entry for the virtual machine in a partition table. The mechanism may partition a shared cache using a vertical (way) partition and/or a horizontal partition. The entry in the partition table includes a vertical partition control and a horizontal partition control. For each cache access, the virtual machine passes the virtual ID along with the address to the shared cache. If the cache access results in a miss, the shared cache uses the partition table to select a victim cache line for replacement.

Proceedings ArticleDOI
01 Apr 2010
TL;DR: ESP-NUCA synergistically integrates victims and replicas thus making it possible to take advantage of multiple-readers for shared data, and to maximize cache usage under unbalanced core utilization, and leads to stable behavior within the whole system across a broad spectrum of working scenarios.
Abstract: This paper introduces a cost effective cache architecture called Enhanced Shared-Private Non-Uniform Cache Architecture (ESP-NUCA), which is suitable for highperformance Chip MultiProcessors (CMPs). This architecture enhances system stability by combining the advantages of private and shared caches. Starting from a shared NUCA, ESP-NUCA introduces a low-cost mechanism to dynamically allocate private cache blocks closer to their owner processor. In this way, average on-chip access latency is reduced and inter-core interference minimized. ESP-NUCA synergistically integrates victims and replicas thus making it possible to take advantage of multiple-readers for shared data, and to maximize cache usage under unbalanced core utilization. This architecture leads to stable behavior within the whole system across a broad spectrum of working scenarios. ESP-NUCA not only outperforms architectures with similar implementation costs such as private and shared caches by up to 20% and 40% respectively, but even outperforms much costlier architectures such as D-NUCA [13] by up to 28%, Adaptive Selective Replication [3] by up to 19%, and Cooperative Caching [5] by up to 15%. Moreover, performance variance throughout the set of benchmarks is 37% lower than with ASR, 87% lower than with D-NUCA, and 43% lower than with Cooperative Caching.

Proceedings ArticleDOI
13 Nov 2010
TL;DR: A classification of applications into four cache usage categories is introduced and how applications from different categories affect each other's performance indirectly through cache sharing is discussed and a scheme to optimize such sharing is devised.
Abstract: Contention for shared cache resources has been recognized as a major bottleneck for multicores--especially for mixed workloads of independent applications. While most modern processors implement instructions to manage caches, these instructions are largely unused due to a lack of understanding of how to best leverage them. This paper introduces a classification of applications into four cache usage categories. We discuss how applications from different categories affect each other's performance indirectly through cache sharing and devise a scheme to optimize such sharing. We also propose a low-overhead method to automatically find the best per-instruction cache management policy. We demonstrate how the indirect cache-sharing effects of mixed workloads can be tamed by automatically altering some instructions to better manage cache resources. Practical experiments demonstrate that our software-only method can improve application performance up to 35% on x86 multicore hardware.

Proceedings Article
01 Jan 2010
TL;DR: This work infirms the assumption that even when the cores run completely independent tasks, there exist dependencies arising from running on the same chip, and using the same cache, which cause the standard caching algorithms to underperform.
Abstract: Almost all of the modern computers use multiple cores, and the number of cores is expected to increase as hardware prices go down, and Moore's law fails to hold. Most of the theoretical algorithmic work so far has focused on the setting where multiple cores are performing the same task. Indeed, one is tempted to assume that when the cores are independent then the current design performs well. This work infirms this assumption by showing that even when the cores run completely independent tasks, there exist dependencies arising from running on the same chip, and using the same cache. These dependencies cause the standard caching algorithms to underperform. To address the new challenge, we revisit some aspects of the classical caching design. More specifically, we focus on the page replacement policy of the first cache shared between all the cores (usually the L2 cache). We make the simplifying assumption that since the cores are running independent tasks, they are accessing disjoint memory locations (in particular this means that maintaining coherency is not an issue). We show, that even under this simplifying assumption, the multicore case is fundamentally different then the single core case. In particular 1. LRU performs poorly, even with resource augmentation. 2. The offline version of the caching problem is NP complete. Any attempt to design an efficient cache for a multicore machine in which the cores may access the same memory has to perform well also in this simpler setting. We provide some intuition to what an efficient solution could look like, by 1. Partly characterizing the offline solution, showing that it is determined by the part of the cache which is devoted to each core at every timestep. 2. Presenting a PTAS for the offline problem, for some range of the parameters. In the recent years, multicore caching was the subject of extensive experimental research. The conclusions of some of these works are that LRU is inefficient in practice. The heuristics which they propose to replace it are based on dividing the cache between cores, and handling each part independently. Our work can be seen as a theoretical explanation to the results of these experiments.

Patent
18 Feb 2010
TL;DR: In this paper, the authors propose a method and apparatus for filtering memory probe activity for writes in a distributed shared memory computer, where the cache data block is evicted and stored in a remote cache.
Abstract: A method and apparatus for filtering memory probe activity for writes in a distributed shared memory computer. In one embodiment, the method may include assigning an uncached directory state to a cache data block in response to evicting the cache data block. In another embodiment, the method may include assigning a remote directory state to a cache data block in response to evicting the cache data block and storing it in a remote cache. In a third embodiment, the method may include assigning a pairwise-shared directory state in response to a second processor node initiating a load operation to a cache data block in a modified cache state in a first processor node. In a fourth embodiment, the method may include assigning a migratory directory state in response to a processor node initiating a store operation to a cache data block in a pairwise-shared cache state.

Journal ArticleDOI
TL;DR: DiCo-CMP is presented, a novel cache coherence protocol especially suited to future many-core tiled CMP architectures that reduces the miss latency compared to a directory protocol by sending requests directly to the cache that provides the block in a cache miss.
Abstract: Future many-core CMP designs that will integrate tens of processor cores on-chip will be constrained by area and power. Area constraints make impractical the use of a bus or a crossbar as the on-chip interconnection network, and tiled CMPs organized around a direct interconnection network will probably be the architecture of choice. Power constraints make impractical to rely on broadcasts (as, for example, Token-CMP does) or any other brute-force method for keeping cache coherence, and directory-based cache coherence protocols are currently being employed. Unfortunately, directory protocols introduce indirection to access directory information, which negatively impacts performance. In this work, we present DiCo-CMP, a novel cache coherence protocol especially suited to future many-core tiled CMP architectures. In DiCo-CMP, the task of storing up-to-date sharing information and ensuring ordered accesses for every memory block is assigned to the cache that must provide the block on a miss. Therefore, DiCo-CMP reduces the miss latency compared to a directory protocol by sending requests directly to the cache that provides the block in a cache miss. These latency reductions result in improvements in execution time of up to 6 percent, on average, over a directory protocol. In comparison with Token-CMP, our protocol only sends one request message for each cache miss, as such is able to reduce network traffic by 43 percent.

Patent
07 Sep 2010
TL;DR: In this paper, techniques for using an intermediate cache between the shared cache of an application and the non-volatile storage of a storage system are described, where the caching policies used to populate the intermediate cache are intelligent, taking into account factors that may include which object an item belongs to, the item type, a characteristic of the item, or the type of operation in which the item is involved.
Abstract: Techniques are provided for using an intermediate cache between the shared cache of an application and the non-volatile storage of a storage system. The application may be any type of application that uses a storage system to persistently store data. The intermediate cache may be local to the machine upon which the application is executing, or may be implemented within the storage system. In one embodiment where the application is a database server, the database system includes both a DB server-side intermediate cache, and a storage-side intermediate cache. The caching policies used to populate the intermediate cache are intelligent, taking into account factors that may include which object an item belongs to, the item type of the item, a characteristic of the item, or the type of operation in which the item is involved.

Journal ArticleDOI
TL;DR: This work proposes a technique which leverages configurable data caches to address the problem of energy inefficiency and intertask interference in multitasking embedded systems, and introduces a profile-based, off-line algorithm, which identifies a beneficial cache partitioning.
Abstract: We propose a technique that leverages configurable data caches to address the problem of energy inefficiency and intertask interference in multitasking embedded systems. Data caches are often necessary to provide the required memory bandwidth. However, caches introduce two important problems for embedded systems. Caches contribute to a significant amount of power as they typically occupy a large part of the chip and are accessed frequently. In nanometer technologies, such large structures contribute significantly to the total leakage power as well. Additionally, cache outcomes in multitasking environments are notoriously difficult to predict, if not impossible, thus resulting in poor real-time guarantees. We study the effect of multiprogramming workloads on the data cache in a preemptive multitasking environment, and propose a technique which leverages configurable cache architectures to not only eliminate intertask cache interference, but also to significantly reduce both dynamic and leakage power. By mapping tasks to different cache partitions, interference is completely eliminated. Dynamic and leakage power are significantly reduced as only a subset of the cache is active at any moment. We introduce a profile-based, off-line algorithm, which identifies a beneficial cache partitioning. The OS configures the data cache during context-switch by activating the corresponding partition. Our experiments on a large set of multitasking benchmarks demonstrate that our technique not only efficiently eliminates intertask interference, but also significantly reduces both dynamic and leakage power.