scispace - formally typeset
Search or ask a question

Showing papers on "Cache pollution published in 2010"


Proceedings ArticleDOI
19 Jun 2010
TL;DR: This paper proposes Static RRIP that is scan-resistant and Dynamic RRIP (DRRIP) that is both scan- resistant and thrash-resistant that require only 2-bits per cache block and easily integrate into existing LRU approximations found in modern processors.
Abstract: Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and misses. Applications that exhibit a distant re-reference interval perform badly under LRU. Such applications usually have a working-set larger than the cache or have frequent bursts of references to non-temporal data (called scans). To improve the performance of such workloads, this paper proposes cache replacement using Re-reference Interval Prediction (RRIP). We propose Static RRIP (SRRIP) that is scan-resistant and Dynamic RRIP (DRRIP) that is both scan-resistant and thrash-resistant. Both RRIP policies require only 2-bits per cache block and easily integrate into existing LRU approximations found in modern processors. Our evaluations using PC games, multimedia, server and SPEC CPU2006 workloads on a single-core processor with a 2MB last-level cache (LLC) show that both SRRIP and DRRIP outperform LRU replacement on the throughput metric by an average of 4% and 10% respectively. Our evaluations with over 1000 multi-programmed workloads on a 4-core CMP with an 8MB shared LLC show that SRRIP and DRRIP outperform LRU replacement on the throughput metric by an average of 7% and 9% respectively. We also show that RRIP outperforms LFU, the state-of the art scan-resistant replacement algorithm to-date. For the cache configurations under study, RRIP requires 2X less hardware than LRU and 2.5X less hardware than LFU.

715 citations


Journal ArticleDOI
TL;DR: Power Systems™ continue strong 7th Generation Power chip: Balanced Multi-Core design EDRAM technology SMT4 greater then 4X performance in same power envelope as previous generation.
Abstract: The Power7 is IBM's first eight-core processor, with each core capable of four-way simultaneous-multithreading operation. Its key architectural features include an advanced memory hierarchy with three levels of on-chip cache; embedded-DRAM devices used in the highest level of the cache; and a new memory interface. This balanced multicore design scales from 1 to 32 sockets in commercial and scientific environments.

259 citations


Proceedings ArticleDOI
04 Dec 2010
TL;DR: The zcache is presented, a cache design that allows much higher associativity than the number of physical ways, and it is shown that zcaches provide higher performance and better energy efficiency than conventional caches without incurring the overheads of designs with a large number of ways.
Abstract: The ever-increasing importance of main memory latency and bandwidth is pushing CMPs towards caches with higher capacity and associativity. Associativity is typically improved by increasing the number of ways. This reduces conflict misses, but increases hit latency and energy, placing a stringent trade-off on cache design. We present the zcache, a cache design that allows much higher associativity than the number of physical ways (e.g. a 64-associative cache with 4 ways). The zcache draws on previous research on skew-associative caches and cuckoo hashing. Hits, the common case, require a single lookup, incurring the latency and energy costs of a cache with a very low number of ways. On a miss, additional tag lookups happen off the critical path, yielding an arbitrarily large number of replacement candidates for the incoming block. Unlike conventional designs, the zcache provides associativity by increasing the number of replacement candidates, but not the number of cache ways. To understand the implications of this approach, we develop a general analysis framework that allows to compare associativity across different cache designs (e.g. a set-associative cache and a zcache) by representing associativity as a probability distribution. We use this framework to show that for zcaches, associativity depends only on the number of replacement candidates, and is independent of other factors (such as the number of cache ways or the workload). We also show that, for the same number of replacement candidates, the associativity of a zcache is superior than that of a set-associative cache for most workloads. Finally, we perform detailed simulations of multithreaded and multiprogrammed workloads on a large-scale CMP with zcache as the last-level cache. We show that zcaches provide higher performance and better energy efficiency than conventional caches without incurring the overheads of designs with a large number of ways.

203 citations


Patent
08 Sep 2010
TL;DR: In this paper, an apparatus, system, and method are disclosed for an caching data using a solid-state storage device, where metadata indicates what data in the cache is valid, as well as information about what data from the nonvolatile cache has been stored in the backing store.
Abstract: An apparatus, system, and method are disclosed for an caching data using a solid-state storage device. The solid-state storage device maintains metadata pertaining to cache operations performed on the solid-state storage device, as well as storage operations of the solid-state storage device. The metadata indicates what data in the cache is valid, as well as information about what data in the nonvolatile cache has been stored in backing store. A backup engine works through units in the nonvolatile cache device and backs up the valid data to backing store. During grooming operations, the groomer determines whether the data is valid and whether the data is discardable. Data that is both valid and discardable may be removed during the grooming operation. The groomer may also determine whether the data is cold in determining whether to remove the data from the cache device. The cache device may present to clients a logical space that is the same size as the backing store. The cache device may be transparent to the clients.

196 citations


Patent
07 Jul 2010
TL;DR: In this paper, a security module on a computing device applies security rules to examine content in a network cache and identify suspicious cache content, such as a rule determining whether the cache content is associated with modified-time set into the future, and a rule deciding whether cache content was created in a low-security environment.
Abstract: A security module on a computing device applies security rules to examine content in a network cache and identify suspicious cache content. Cache content is identified as suspicious according to security rules, such as a rule determining whether the cache content is associated with modified-time set into the future, and a rule determining whether the cache content was created in a low-security environment. The security module may establish an out-of-band connection with the websites from which the cache content originated through a high security access network to receive responses from the websites, and use the responses to determine whether the cache content is suspicious cache content. Suspicious cache content is removed from the network cache to prevent the suspicious cache content from carrying out malicious activities.

168 citations


Journal ArticleDOI
01 Sep 2010
TL;DR: This work presents a technique for using solid-state storage as a caching layer between RAM and hard disks in database management systems by caching data that is accessed frequently, disk I/O is reduced.
Abstract: High-end solid state disks (SSDs) provide much faster access to data compared to conventional hard disk drives. We present a technique for using solid-state storage as a caching layer between RAM and hard disks in database management systems. By caching data that is accessed frequently, disk I/O is reduced. For random I/O, the potential performance gains are particularly significant. Our system continuously monitors the disk access patterns to identify hot regions of the disk. Temperature statistics are maintained at the granularity of an extent, i.e., 32 pages, and are kept current through an aging mechanism. Unlike prior caching methods, once the SSD is populated with pages from warm regions cold pages are not admitted into the cache, leading to low levels of cache pollution. Simulations based on DB2 I/O traces, and a prototype implementation within DB2 both show substantial performance improvements.

164 citations


Journal ArticleDOI
TL;DR: This work presents a lossless compression algorithm that has been designed for fast on-line data compression, and cache compression in particular, and reduces the proposed algorithm to a register transfer level hardware design, permitting performance, power consumption, and area estimation.
Abstract: Microprocessor designers have been torn between tight constraints on the amount of on-chip cache memory and the high latency of off-chip memory, such as dynamic random access memory. Accessing off-chip memory generally takes an order of magnitude more time than accessing on-chip cache, and two orders of magnitude more time than executing an instruction. Computer systems and microarchitecture researchers have proposed using hardware data compression units within the memory hierarchies of microprocessors in order to improve performance, energy efficiency, and functionality. However, most past work, and all work on cache compression, has made unsubstantiated assumptions about the performance, power consumption, and area overheads of the proposed compression algorithms and hardware. It is not possible to determine whether compression at levels of the memory hierarchy closest to the processor is beneficial without understanding its costs. Furthermore, as we show in this paper, raw compression ratio is not always the most important metric. In this work, we present a lossless compression algorithm that has been designed for fast on-line data compression, and cache compression in particular. The algorithm has a number of novel features tailored for this application, including combining pairs of compressed lines into one cache line and allowing parallel compression of multiple words while using a single dictionary and without degradation in compression ratio. We reduced the proposed algorithm to a register transfer level hardware design, permitting performance, power consumption, and area estimation. Experiments comparing our work to previous work are described.

161 citations


01 Jan 2010
TL;DR: The key idea of the proposed technique is to aggressively send out writeback requests that are expected to hit in DRAM row buffers before they would normally be evicted by the last-level cache replacement policy and have the DRAM controller service as many writes as possible together.
Abstract: Read and write requests from a processor contend for the main memory data bus. System performance depends heavily on when read requests are serviced since they are required for an application’s forward progress whereas writes do not need to be performed immediately. However, writes eventually have to be written to memory because the storage required to buffer them on-chip is limited. In modern high bandwidth DDR (Double Data Rate)-based memory systems write requests significantly interfere with the servicing of read requests by delaying the more critical read requests and by causing the memory bus to become idle when switching between the servicing of a write and read request. This interference significantly degrades overall system performance. We call this phenomenon write-caused interference. To reduce write-caused interference, this paper proposes a new last-level cache writeback policy, called DRAM-aware writeback. The key idea of the proposed technique is to aggressively send out writeback requests that are expected to hit in DRAM row buffers before they would normally be evicted by the last-level cache replacement policy and have the DRAM controller service as many writes as possible together. Doing so not only reduces the amount of time to service writes by improving their row buffer locality but also reduces the idle bus cycles wasted due to switching between the servicing of a write and a read request. DRAM-aware writeback improves system performance by 7.1% and 12.8% on single and 4-core systems respectively. The performance benefits of the mechanism increases in systems with prefetching since such systems have higher contention between reads and writes in the DRAM system.

145 citations


Proceedings ArticleDOI
19 Jun 2010
TL;DR: This paper demonstrates that performance limiting effects of highly-threaded architectures can be overcome, and shows that through awareness of the physical main memory layout and by focusing on writes, both read and write average latency can be shortened, memory power reduced, and overall system performance improved.
Abstract: In computer architecture, caches have primarily been viewed as a means to hide memory latency from the CPU. Cache policies have focused on anticipating the CPU's data needs, and are mostly oblivious to the main memory. In this paper, we demonstrate that the era of many-core architectures has created new main memory bottlenecks, and mandates a new approach: coordination of cache policy with main memory characteristics. Using the cache for memory optimization purposes, we propose a Virtual Write Queue which dramatically expands the memory controller's visibility of processor behavior, at low implementation overhead. Through memory-centric modification of existing policies, such as scheduled writebacks, this paper demonstrates that performance limiting effects of highly-threaded architectures can be overcome. We show that through awareness of the physical main memory layout and by focusing on writes, both read and write average latency can be shortened, memory power reduced, and overall system performance improved. Through full-system cycle-accurate simulations of SPEC cpu2006, we demonstrate that the proposed Virtual Write Queue achieves an average 10.9% system-level throughput improvement on memory-intensive workloads, along with an overall reduction of 8.7% in memory power across the whole suite.

143 citations


Proceedings ArticleDOI
08 Mar 2010
TL;DR: This paper modeled the timing, energy, endurance, and area of PCM caches and integrated them into a PCM cache simulator to evaluate the techniques and shows that the design can achieve 8% of energy saving and 3.8 years of lifetime compared with a baseline PCM Cache.
Abstract: Phase change memory (PCM) is one of the most promising technology among emerging non-volatile random access memory technologies. Implementing a cache memory using PCM provides many benefits such as high density, non-volatility, low leakage power, and high immunity to soft error. However, its disadvantages such as high write latency, high write energy, and limited write endurance prevent it from being used as a drop-in replacement of an SRAM cache. In this paper, we study a set of techniques to design an energy- and endurance-aware PCM cache. We also modeled the timing, energy, endurance, and area of PCM caches and integrated them into a PCM cache simulator to evaluate the techniques. Experiments show that our PCM cache design can achieve 8% of energy saving and 3.8 years of lifetime compared with a baseline PCM cache having less than a hour of lifetime.

140 citations


Patent
25 Mar 2010
TL;DR: In this article, a method for optimising the distribution of data objects between caches in a cache domain of a resource limited network is described, where object information including the request frequency of each requested data object and the locations of the caches at which the requests were received is collated and stored.
Abstract: There is described a method for optimising the distribution of data objects between caches in a cache domain of a resource limited network. User requests for data objects are received at caches in the cache domain. A notification is sent from each cache at which a request is received to a cache manager. The notification reports the user request and identifies the requested data object. At the cache manager, object information including the request frequency of each requested data object and the locations of the caches at which the requests were received is collated and stored. At the cache manager, objects for distribution within the cache domain are identified on the basis of the object information. Instructions are sent from the cache manager to the caches to distribute data objects stored in those caches between themselves. The objects are classified into classes according to popularity, the classes including a high popularity class comprising objects which should be distributed to all caches in the cache domain, a medium popularity class comprising objects which should be distributed to a subset of the caches in the cache domain, and a low popularity class comprising objects which should not be distributed.

Patent
02 Feb 2010
TL;DR: In this article, a persistent cache is implemented in a flash memory that includes a journal section that stores metadata and a low frequency section and a high frequency section that store data entries.
Abstract: A persistent cache is implemented in a flash memory that includes a journal section that stores metadata and a low frequency section and a high frequency section that store data entries Writing new metadata to the persistent cache includes sequentially advancing to a next sector containing an invalid metadata entry, saving a working copy of the sector in RAM, writing metadata corresponding to one or more new data entries in the working copy, and overwriting the sector in the flash memory containing the invalid entry with the working copy Writes to the low frequency and high frequency sections occur sequentially in the current locations of a low frequency section pointer and a high frequency section pointer, respectively In a persistent cache, the reconstruction of a non-persistent cache utilizes the metadata entry that has the most recent timestamp

Proceedings ArticleDOI
04 Dec 2010
TL;DR: This work proposes Temporal Locality Aware (TLA) cache management policies to allow an inclusive LLC to be aware of the temporal locality of lines in the core caches and shows that these policies improve inclusive cache performance without requiring any additional hardware structures.
Abstract: Inclusive caches are commonly used by processors to simplify cache coherence. However, the trade-off has been lower performance compared to non-inclusive and exclusive caches. Contrary to conventional wisdom, we show that the limited performance of inclusive caches is mostly due to inclusion victims—lines that are evicted from the core caches to satisfy the inclusion property—and not the reduced cache capacity of the hierarchy due to the duplication of data. These inclusion victims are incorrectly chosen for replacement because the last-level cache (LLC) is unaware of the temporal locality of lines in the core caches. We propose Temporal Locality Aware (TLA) cache management policies to allow an inclusive LLC to be aware of the temporal locality of lines in the core caches. We propose three TLA policies: Temporal Locality Hints (TLH), Early Core Invalidation (ECI), and Query Based Selection (QBS). All three policies improve inclusive cache performance without requiring any additional hardware structures. In fact, QBS performs similar to a non-inclusive cache hierarchy.

Patent
30 Jul 2010
TL;DR: In this paper, an apparatus, system, and method for redundant write caching is described. But the authors do not specify a hardware implementation of the redundancy mechanism, only a plurality of modules, including a write request module, a first cache write module, and a second cache write modules.
Abstract: An apparatus, system, and method are disclosed for redundant write caching. The apparatus, system, and method are provided with a plurality of modules including a write request module, a first cache write module, a second cache write module, and a trim module. The write request module detects a write request to store data on a storage device. The first cache write module writes data of the write request to a first cache. The second cache write module writes the data to a second cache. The trim module trims the data from one of the first cache and the second cache in response to an indicator that the storage device stores the data. The data remains available in the other of the first cache and the second cache to service read requests.

Patent
Kenneth P. Der1
28 Sep 2010
TL;DR: In this article, the authors propose a technique to protect host data by storing the block of data as a dirty cache block in a local cache of the local computerized node and performing a set of external caching operations to cache the set of sub-blocks.
Abstract: A technique protects host data. The technique involves receiving, at a local computerized node, a block of data from a host computer, the block of data including data sub-blocks. The technique further involves storing the block of data, as a dirty cache block, in a local cache of the local computerized node. The technique further involves performing a set of external caching operations to cache a set of sub-blocks in a set of external computerized nodes in communication with the local computerized node. Each external caching operation caches a respective sub-block of the set of sub-blocks in a cache of a respective external computerized node. The set of sub-blocks includes (i) the data sub-blocks of the block of data from the host and (ii) a set of checksums derived from the data sub-blocks of the block of data from the host.

Proceedings ArticleDOI
09 Jan 2010
TL;DR: A systematic measurement of the influence of CMP cache sharing on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered shows some surprising results.
Abstract: Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention.A number of studies have examined the influence of cache sharing on multithreaded applications, but most of them have concentrated on the design or management of shared cache, rather than a systematic measurement of the influence. Consequently, prior measurements have been constrained by the reliance on simulators, the use of out-of-date benchmarks, and the limited coverage of deciding factors. The influence of CMP cache sharing on contemporary multithreaded applications remains preliminarily understood.In this work, we conduct a systematic measurement of the influence on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered. The measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions, regardless of the types of parallelism, input datasets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch of current development and compilation of multithreaded applications and CMP architectures. By transforming the programs in a cache-sharing-aware manner, we observe up to 36% performance increase when the threads are placed on cores appropriately.

Proceedings ArticleDOI
28 Mar 2010
TL;DR: CAMP estimates the performance degradation due to cache contention of processes running on CMPs and provides an automated way to obtain process-dependent characteristics, such as reuse distance histograms, without offline simulation, operating system modification, or additional hardware.
Abstract: The ongoing move to chip multiprocessors (CMPs) permits greater sharing of last-level cache by processor cores but this sharing aggravates the cache contention problem, potentially undermining performance improvements. Accurately modeling the impact of inter-process cache contention on performance and power consumption is required for optimized process assignment. However, techniques based on exhaustive consideration of process-to-processor mappings and cycle-accurate simulation are inefficient or intractable for CMPs, which often permit a large number of potential assignments. This paper proposes CAMP, a fast and accurate shared cache aware performance model for multi-core processors. CAMP estimates the performance degradation due to cache contention of processes running on CMPs. It uses reuse distance histograms, cache access frequencies, and the relationship between the throughput and cache miss rate of each process to predict its effective cache size when running concurrently and sharing cache with other processes, allowing instruction throughput estimation.We also provide an automated way to obtain process-dependent characteristics, such as reuse distance histograms, without offline simulation, operating system (OS) modification, or additional hardware. We tested the accuracy of CAMP using 55 different combinations of 10 SPEC CPU2000 benchmarks on a dual-core CMP machine. The average throughput prediction error was 1.57%.

Patent
15 Dec 2010
TL;DR: In this paper, a multi-tiered cache manager and methods for managing multi-tier cache are described, which causes cached data to be initially stored in RAM elements and selects portions of the cached data stored in the RAM elements to be moved to the flash elements.
Abstract: A multi-tiered cache manager and methods for managing multi-tiered cache are described. Multi-tiered cache manager causes cached data to be initially stored in the RAM elements and selects portions of the cached data stored in the RAM elements to be moved to the flash elements. Each flash element is organized as a plurality of write blocks having a block size and wherein a predefined maximum number of writes is permitted to each write block. The portions of the cached data may be selected based on a maximum write rate calculated from the maximum number of writes allowed for the flash device and a specified lifetime of the cache system.

Journal ArticleDOI
TL;DR: The design and implementation of cooperative cache in wireless P2P networks are presented, and a novel asymmetric cooperative cache approach is proposed, where the data requests are transmitted to the cache layer on every node, but the data replies are only transmitted toThe cache layer at the intermediate nodes that need to cache the data.
Abstract: Some recent studies have shown that cooperative cache can improve the system performance in wireless P2P networks such as ad hoc networks and mesh networks. However, all these studies are at a very high level, leaving many design and implementation issues unanswered. In this paper, we present our design and implementation of cooperative cache in wireless P2P networks, and propose solutions to find the best place to cache the data. We propose a novel asymmetric cooperative cache approach, where the data requests are transmitted to the cache layer on every node, but the data replies are only transmitted to the cache layer at the intermediate nodes that need to cache the data. This solution not only reduces the overhead of copying data between the user space and the kernel space, it also allows data pipelines to reduce the end-to-end delay. We also study the effects of different MAC layers, such as 802.11-based ad hoc networks and multi-interface-multichannel-based mesh networks, on the performance of cooperative cache. Our results show that the asymmetric approach outperforms the symmetric approach in traditional 802.11-based ad hoc networks by removing most of the processing overhead. In mesh networks, the asymmetric approach can significantly reduce the data access delay compared to the symmetric approach due to data pipelines.

Patent
20 Dec 2010
TL;DR: Cache management techniques for a content distribution network (CDN), for example, a video on demand (VOD) system supporting user requests and delivery of video content, are described in this paper.
Abstract: Cache management techniques are described for a content distribution network (CDN), for example, a video on demand (VOD) system supporting user requests and delivery of video content A preferred cache size may be calculated for one or more cache devices in the CDN, for example, based on a maximum cache memory size, a bandwidth availability associated with the CDN, and a title dispersion calculation determined by the user requests within the CDN After establishing the cache with a set of assets (eg, video content), an asset replacement algorithm may be executed at one or more cache devices in the CDN When a determination is made that a new asset should be added to a full cache, a multi-factor comparative analysis may be performed on the assets currently residing in the cache, comparing the popularity and size of assets and combinations of assets, along with other factors to determine which assets should be replaced in the cache device

Proceedings ArticleDOI
13 Apr 2010
TL;DR: DProf helps programmers understand cache miss costs by attributing misses to data types instead of code, and introduces a number of new views of cache miss data, including a data profile, which reports the data types with the most cache misses, and a data flow graph, which summarizes how objects of a given type are accessed throughout their lifetime, and which accesses incur expensive cross-CPU cache loads.
Abstract: Effective use of CPU data caches is critical to good performance, but poor cache use patterns are often hard to spot using existing execution profiling tools. Typical profilers attribute costs to specific code locations. The costs due to frequent cache misses on a given piece of data, however, may be spread over instructions throughout the application. The resulting individually small costs at a large number of instructions can easily appear insignificant in a code profiler's output.DProf helps programmers understand cache miss costs by attributing misses to data types instead of code. Associating cache misses with data helps programmers locate data structures that experience misses in many places in the application's code. DProf introduces a number of new views of cache miss data, including a data profile, which reports the data types with the most cache misses, and a data flow graph, which summarizes how objects of a given type are accessed throughout their lifetime, and which accesses incur expensive cross-CPU cache loads. We present two case studies of using DProf to find and fix cache performance bottlenecks in Linux. The improvements provide a 16-57% throughput improvement on a range of memcached and Apache workloads.

Patent
Scott Howard Davis1
26 Apr 2010
TL;DR: In this article, the content aware cache filter component of the hypervisor of the server is implemented in an I/O virtualization layer that sits above a file system layer, such that any file system protocol may be implemented in the file system level.
Abstract: A server supporting the implementation of virtual machines includes a local memory used for caching, such as a solid state device drive. During I/O intensive processes, such as a boot storm, a “content aware” cache filter component of the hypervisor of the server first accesses a cache structure in a content cache device to determine whether data blocks have been stored in the cache structure prior to requesting the data blocks from a networked disk array via a standard I/O stack of the hypervisor. The content aware cache filter component is implemented in an I/O virtualization layer of the standard I/O stack that sits above a file system layer of the standard I/O stack, such that any file system protocol may be implemented in the file system layer.

Proceedings ArticleDOI
11 Sep 2010
TL;DR: Increasing cache efficiency can improve performance by reducing miss rate, or alternately, improve power and energy by allowing a smaller cache with the same miss rate.
Abstract: Caches mitigate the long memory latency that limits the performance of modern processors. However, caches can be quite inefficient. On average, a cache block in a 2MB L2 cache is dead 59% of the time, i.e., it will not be referenced again before it is evicted. Increasing cache efficiency can improve performance by reducing miss rate, or alternately, improve power and energy by allowing a smaller cache with the same miss rate.This paper proposes using predicted dead blocks to hold blocks evicted from other sets. When these evicted blocks are referenced again, the access can be satisfied from the other set, avoiding a costly access to main memory. The pool of predicted dead blocks can be thought of as a virtual victim cache. For a set of memory-intensive single-threaded workloads, a virtual victim cache in a 16-way set associative 2MB L2 cache reduces misses by 26%, yields an geometric mean speedup of 12.1% and improves cache efficiency by 27% on average, where cache efficiency is defined as the average time during which cache blocks contain live information. This virtual victim cache yields a lower average miss rate than a fully-associative LRU cache of the same capacity. For a set of multi-core workloads, the virtual victim cache improves throughput performance by 4% over LRU while improving cache efficiency by 62%.Alternately, a 1.7MB virtual victim cache achieves about the same performance as a larger 2MB L2 cache, reducing the number of SRAM cells required by 16%, thus maintaining performance while reducing power and area.

Patent
Wenchu Cen1
19 May 2010
TL;DR: In this article, the cache cluster is configurable in an active cluster configuration mode, where the plurality of cache service nodes are all in working state and a master cache service node is selected among the plurality.
Abstract: Processing cache data includes sending a cache processing request to a master cache service node in a cache cluster that includes a plurality of cache service nodes, the cache cluster being configurable in an active cluster configuration mode wherein the plurality of cache service nodes are all in working state and a master cache service node is selected among the plurality of cache service nodes, or in a standby cluster configuration mode, wherein the master cache service node is the only node among the plurality of cache service nodes that is in working state. It further includes waiting for a response from the master cache service node, determining whether the master cache service node has failed; and in the event that the master cache service node has failed, selecting a backup cache service node.

Proceedings ArticleDOI
13 Jun 2010
TL;DR: An optimal algorithm and a heuristic approach that use the temporal reuse profile to determine the most beneficial memory blocks to be locked in the cache and provides significant improvement compared to the state-of-the-art locking algorithm both in terms of performance and efficiency.
Abstract: The performance of most embedded systems is critically dependent on the average memory access latency. Improving the cache hit rate can have significant positive impact on the performance of an application. Modern embedded processors often feature cache locking mechanisms that allow memory blocks to be locked in the cache under software control. Cache locking was primarily designed to offer timing predictability for hard real-time applications. Hence, the compiler optimization techniques focus on employing cache locking to improve worst-case execution time. However, cache locking can be quite effective in improving the average-case execution time of general embedded applications as well. In this paper, we explore static instruction cache locking to improve average-case program performance. We introduce temporal reuse profile to accurately and efficiently model the cost and benefit of locking memory blocks in the cache. We propose an optimal algorithm and a heuristic approach that use the temporal reuse profile to determine the most beneficial memory blocks to be locked in the cache. Experimental results show that locking heuristic achieves close to optimal results and can improve the cache miss rate by up to 24% across a suite of real-world benchmarks. Moreover, our heuristic provides significant improvement compared to the state-of-the-art locking algorithm both in terms of performance and efficiency.

Patent
17 Sep 2010
TL;DR: In this article, a cache device may comprise a non-volatile storage device configured to perform cache functions for a backing store, and a modified cache policy for the cache device in response to the risk of data loss exceeding the threshold risk level.
Abstract: Apparatuses, systems, and methods are disclosed for implementing a cache policy. A method may include determining a risk of data loss on a cache device. The cache device may comprise a non-volatile storage device configured to perform cache functions for a backing store. The cache device may implement a cache policy. A method may include determining that a risk of data loss on the cache devices exceeds a threshold risk level. A method may include implementing a modified cache policy for the cache device in response to the risk of data loss exceeding the threshold risk level. The modified cache policy may reduce the risk of data loss below the threshold level.

Patent
06 Dec 2010
TL;DR: In this article, the authors propose a migration policy to determine how long to cache the data in the write cache before migrating the data to the SDD using one or more migration triggers that cause the contents of the cache to be flushed to the SSD.
Abstract: A hybrid storage device uses a write cache such as a hard disk drive, for example, to cache data to a solid state drive (SSD). Data is logged sequentially to the write cache and later migrated to the SSD. The SSD is a primary storage that stores data permanently. The write cache is a persistent durable cache that may store data of disk write operations temporarily in a log structured fashion. A migration policy may be used to determine how long to cache the data in the write cache before migrating the data to the SDD. The migration policy may be implemented using one or more migration triggers that cause the contents of the write cache to be flushed to the SSD. Migration triggers may include a timeout trigger, a read threshold trigger, and a migration size trigger, for example.

Patent
01 Dec 2010
TL;DR: In this article, the authors propose an embodiment of a virtual address cache memory including a TLB virtual page memory, a data memory, and a cache state memory for the cache data stored in the data memory.
Abstract: An embodiment provides a virtual address cache memory including: a TLB virtual page memory configured to, when a rewrite to a TLB occurs, rewrite entry data; a data memory configured to hold cache data using a virtual page tag or a page offset as a cache index; a cache state memory configured to hold a cache state for the cache data stored in the data memory, in association with the cache index; a first physical address memory configured to, when the rewrite to the TLB occurs, rewrite a held physical address; and a second physical address memory configured to, when the cache data is written to the data memory after the occurrence of the rewrite to the TLB, rewrite a held physical address.

Proceedings ArticleDOI
19 Apr 2010
TL;DR: This paper shows several methods to keep reuse stacks consistent so that they account for invalidations and cache sharing, either as references arise in a simulated execution or at synchronization points, and shows that adding multicore-awareness substantially improves the ability of reuse distance analysis to model cache behavior.
Abstract: This paper presents and validates methods to extend reuse distance analysis of application locality characteristics to shared-memory multicore platforms by accounting for invalidation-based cache-coherence and inter-core cache sharing. Existing reuse distance analysis methods track the number of distinct addresses referenced between reuses of the same address by a given thread, but do not model the effects of data references by other threads. This paper shows several methods to keep reuse stacks consistent so that they account for invalidations and cache sharing, either as references arise in a simulated execution or at synchronization points. These methods are evaluated against a Simics-based coherent cache simulator running several OpenMP and transaction-based benchmarks. The results show that adding multicore-awareness substantially improves the ability of reuse distance analysis to model cache behavior, reducing the error in miss ratio prediction (relative to cache simulation for a specific cache size) by an average of 70% for per-core caches and an average of 90% for shared caches.

Patent
27 Jan 2010
TL;DR: In this paper, a processor may include several processor cores, each including a respective higher-level cache; a lower level cache including several tag units each including several controllers, where each controller corresponds to a respective cache bank configured to store data, and where the controllers are concurrently operable to access their respective cache banks.
Abstract: A processor may include several processor cores, each including a respective higher-level cache; a lower-level cache including several tag units each including several controllers, where each controller corresponds to a respective cache bank configured to store data, and where the controllers are concurrently operable to access their respective cache banks; and an interconnect network configured to convey data between the cores and the lower-level cache. The controllers may share access to an interconnect egress port coupled to the interconnect network, and may generate multiple concurrent requests to convey data via the shared port, where each of the requests is destined for a corresponding core, and where a datapath width of the port is less than a combined width of the multiple requests. The given tag unit may arbitrate among the controllers for access to the shared port, such that the requests are transmitted to corresponding cores serially rather than concurrently.