scispace - formally typeset
Search or ask a question

Showing papers on "Cache coloring published in 2011"


Proceedings ArticleDOI
03 Dec 2011
TL;DR: This paper proposes a novel Signature-based Hit Predictor (SHiP) to learn the re-reference behavior of cache lines belonging to each signature, and finds that SHiP offers substantial improvements over the baseline LRU replacement and state-of-the-art replacement policy proposals.
Abstract: The shared last-level caches in CMPs play an important role in improving application performance and reducing off-chip memory bandwidth requirements. In order to use LLCs more efficiently, recent research has shown that changing the re-reference prediction on cache insertions and cache hits can significantly improve cache performance. A fundamental challenge, however, is how to best predict the re-reference pattern of an incoming cache line. This paper shows that cache performance can be improved by correlating the re-reference behavior of a cache line with a unique signature. We investigate the use of memory region, program counter, and instruction sequence history based signatures. We also propose a novel Signature-based Hit Predictor (SHiP) to learn the re-reference behavior of cache lines belonging to each signature. Overall, we find that SHiP offers substantial improvements over the baseline LRU replacement and state-of-the-art replacement policy proposals. On average, SHiP improves sequential and multiprogrammed application performance by roughly 10% and 12% over LRU replacement, respectively. Compared to recent replacement policy proposals such as Seg-LRU and SDBP, SHiP nearly doubles the performance gains while requiring less hardware overhead.

235 citations


Proceedings ArticleDOI
03 Dec 2011
TL;DR: The experiments show that on the average, the proposed multi retention level STT-RAM cache reduces 30 ∼ 70% of the total energy compared to previous works, while improving IPC performance for both 2-level and 3-level cache hierarchy.
Abstract: Spin-transfer torque random access memory (STT-RAM) has received increasing attention because of its attractive features: good scalability, zero standby power, non-volatility and radiation hardness. The use of STT-RAM technology in the last level on-chip caches has been proposed as it minimizes cache leakage power with technology scaling down. Furthermore, the cell area of STT-RAM is only 1/9 ~ 1/3 that of SRAM. This allows for a much larger cache with the same die footprint, improving overall system performance through reducing cache misses. However, deploying STT-RAM technology in L1 caches is challenging because of the long and power-consuming write operations. In this paper, we propose both L1 and lower level cache designs that use STT-RAM. In particular, our designs use STT-RAM cells with various data retention time and write performances, made possible by different magnetic tunneling junction (MTJ) designs. For the fast STT-RAM bits with reduced data retention time, a counter controlled dynamic refresh scheme is proposed to maintain the data validity. Our dynamic scheme saves more than 80% refresh energy compared to the simple refresh scheme proposed in previous works. A L1 cache built with ultra low retention STT-RAM coupled with our proposed dynamic refresh scheme can achieve 9.2% in performance improvement, and saves up to 30% of the total energy when compared to one that uses traditional SRAM. For lower level caches with relative large cache capacity, we propose a data migration scheme that moves data between portions of the cache with different retention characteristics so as to maximize the performance and power benefits. Our experiments show that on the average, our proposed multi retention level STT-RAM cache reduces 30 ~ 70% of the total energy compared to previous works, while improving IPC performance for both 2-level and 3-level cache hierarchy.

234 citations


Proceedings ArticleDOI
04 Jun 2011
TL;DR: This work presents Vantage, a novel cache partitioning technique that overcomes the limitations of existing schemes: caches can have tens of partitions with sizes specified at cache line granularity, while maintaining high associativity and strong isolation among partitions.
Abstract: Cache partitioning has a wide range of uses in CMPs, from guaranteeing quality of service and controlled sharing to security-related techniques. However, existing cache partitioning schemes (such as way-partitioning) are limited to coarse-grain allocations, can only support few partitions, and reduce cache associativity, hurting performance. Hence, these techniques can only be applied to CMPs with 2-4 cores, but fail to scale to tens of cores. We present Vantage, a novel cache partitioning technique that overcomes the limitations of existing schemes: caches can have tens of partitions with sizes specified at cache line granularity, while maintaining high associativity and strong isolation among partitions. Vantage leverages cache arrays with good hashing and associativity, which enable soft-pinning a large portion of cache lines. It enforces capacity allocations by controlling the replacement process. Unlike prior schemes, Vantage provides strict isolation guarantees by partitioning most (e.g. 90%) of the cache instead of all of it. Vantage is derived from analytical models, which allow us to provide strong guarantees and bounds on associativity and sizing independent of the number of partitions and their behaviors. It is simple to implement, requiring around 1.5% state overhead and simple changes to the cache controller. We evaluate Vantage using extensive simulations. On a 32-core system, using 350 multiprogrammed workloads and one partition per core, partitioning the last-level cache with conventional techniques degrades throughput for 71% of the workloads versus an unpartitioned cache (by 7% average, 25% maximum degradation), even when using 64-way caches. In contrast, Vantage improves throughput for 98% of the workloads, by 8% on average (up to 20%), using a 4-way cache.

229 citations


Proceedings ArticleDOI
04 Jun 2011
TL;DR: This paper proposes a novel cache architecture that uses variable-strength error-correcting codes (VS-ECC), which significantly reduces power and energy, avoids significant reductions in cache capacity, incurs little area overhead, and avoids large increases in latency and bandwidth.
Abstract: Voltage scaling is one of the most effective mechanisms to improve microprocessors' energy efficiency. However, processors cannot operate reliably below a minimum voltage, Vccmin, since hardware structures may fail. Cell failures in large memory arrays (e.g., caches) typically determine Vccmin for the whole processor. We observe that most cache lines exhibit zero or one failures at low voltages. However, a few lines, especially in large caches, exhibit multi-bit failures and increase Vccmin. Previous solutions either significantly reduce cache capacity to enable uniform error correction across all lines, or significantly increase latency and bandwidth overheads when amortizing the cost of error-correcting codes (ECC) over large lines. In this paper, we propose a novel cache architecture that uses variable-strength error-correcting codes (VS-ECC). In the common case, lines with zero or one failures use a simple and fast ECC. A small number of lines with multi-bit failures use a strong multi-bit ECC that requires some additional area and latency. We present a novel dynamic cache characterization mechanism to determine which lines will exhibit multi-bit failures. In particular, we use multi-bit correction to protect a fraction of the cache after switching to low voltage, while dynamically testing the remaining lines for multi-bit failures. Compared to prior multi-bit-correcting proposals, VS-ECC significantly reduces power and energy, avoids significant reductions in cache capacity, incurs little area overhead, and avoids large increases in latency and bandwidth.

167 citations


Patent
30 Nov 2011
TL;DR: A distributed caching system for storing and serving information modeled as a graph that includes nodes and edges that define associations or relationships between nodes that the edges connect in the graph is described in this article.
Abstract: A distributed caching system for storing and serving information modeled as a graph that includes nodes and edges that define associations or relationships between nodes that the edges connect in the graph.

167 citations


Proceedings ArticleDOI
27 Jun 2011
TL;DR: It is shown that stealing crypto keys in a virtualized cloud may be a real threat by evaluating a cache-based side-channel attack against an encryption process and proposing an approach that leverages dynamic cache coloring: when an application is doing security-sensitive operations, the VMM is notified to swap the associated data to a safe and isolated cache line.
Abstract: Multi-tenant cloud, which features utility-like computing resources to tenants in a “pay-as-you-go” style, has been commercially popular for years. As one of the sole purposes of such a cloud is maximizing resource usages to increase its revenue, it usually uses virtualization to consolidate VMs from different and even mutually-malicious tenants atop a powerful physical machine. This, however, also enables a malicious tenant to steal security-critical information such as crypto keys from victims, due to the shared physical resources such as caches. In this paper, we show that stealing crypto keys in a virtualized cloud may be a real threat by evaluating a cache-based side-channel attack against an encryption process. To mitigate such attacks while not notably degrading performance, we propose an approach that leverages dynamic cache coloring: when an application is doing security-sensitive operations, the VMM is notified to swap the associated data to a safe and isolated cache line. This approach may eliminate cache-based side-channel for security-critical operations, yet ensure efficient resource sharing during normal operations. We demonstrate the applicability by illustrating a preliminary implementation based on Xen and its performance overhead.

164 citations


Patent
01 Nov 2011
TL;DR: In this paper, a cache-defeat detection system and methods for caching of content addressed by identifiers intended to defeat cache are further disclosed, which can detect a data request to a content source for which content received is stored as cache elements in a local cache on the mobile device, determining, from an identifier of the data request, that a cache defeating mechanism is used by the content source, and/or retrieving content from the cached elements in the local cache to respond to the request.
Abstract: Systems and methods for cache defeat detection are disclosed. Moreover, systems and methods for caching of content addressed by identifiers intended to defeat cache are further disclosed. In one aspect, embodiments of the present disclosure include a method, which may be implemented on a system, of resource management in a wireless network by caching content on a mobile device. The method can include detecting a data request to a content source for which content received is stored as cache elements in a local cache on the mobile device, determining, from an identifier of the data request, that a cache defeating mechanism is used by the content source, and/or retrieving content from the cache elements in the local cache to respond to the data request.

161 citations


Proceedings ArticleDOI
Zoltan Majo1, Thomas R. Gross1
04 Jun 2011
TL;DR: This work presents a detailed analysis of a commercially available NUMA-multicore architecture, the Intel Nehalem, and describes two scheduling algorithms: maximum-local, which optimizes for maximum data locality, and N-MASS, which reduces data locality to avoid the performance degradation caused by cache contention.
Abstract: Multiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, the issues of memory management and process mapping must be revisited. We find that optimizing only for data locality can counteract the benefits of cache contention avoidance and vice versa. Therefore, system software must take both data locality and cache contention into account to achieve good performance, and memory management cannot be decoupled from process scheduling. We present a detailed analysis of a commercially available NUMA-multicore architecture, the Intel Nehalem. We describe two scheduling algorithms: maximum-local, which optimizes for maximum data locality, and its extension, N-MASS, which reduces data locality to avoid the performance degradation caused by cache contention. N-MASS is fine-tuned to support memory management on NUMA-multicores and improves performance up to 32%, and 7% on average, over the default setup in current Linux implementations.

110 citations


Patent
12 Aug 2011
TL;DR: In this paper, a storage request module detects an input/output (I/O) request for a storage device cached by solid-state storage media of a cache, and a direct mapping module references a single mapping structure to determine that the cache comprises data of the I/O request.
Abstract: An apparatus, system, and method are disclosed for caching data. A storage request module detects an input/output (“I/O”) request for a storage device cached by solid-state storage media of a cache. A direct mapping module references a single mapping structure to determine that the cache comprises data of the I/O request. The single mapping structure maps each logical block address of the storage device directly to a logical block address of the cache. The single mapping structure maintains a fully associative relationship between logical block addresses of the storage device and physical storage addresses on the solid-state storage media. A cache fulfillment module satisfies the I/O request using the cache in response to the direct mapping module determining that the cache comprises at least one data block of the I/O request.

106 citations


Proceedings ArticleDOI
01 Aug 2011
TL;DR: A low-overhead, fully-hardware technique is utilized to detect write-intensive data blocks of working set and place them into SRAM lines while the remaining data blocks are candidates to be remapped onto STT-RAM blocks during system operation.
Abstract: In this paper, we propose a run-time strategy for managing writes onto last level cache in chip multiprocessors where STT-RAM memory is used as baseline technology. To this end, we assume that each cache set is decomposed into limited SRAM lines and large number of STT-RAM lines. SRAM lines are target of frequently-written data and rarely-written or read-only ones are pushed into STT-RAM. As a novel contribution, a low-overhead, fully-hardware technique is utilized to detect write-intensive data blocks of working set and place them into SRAM lines while the remaining data blocks are candidates to be remapped onto STT-RAM blocks during system operation. Therefore, the achieved cache architecture has large capacity and consumes near zero leakage energy using STT-RAM array; while dynamic write energy, acceptable write latency, and long lifetime is guaranteed via SRAM array. Results of full-system simulation for a quad-core CMP running PARSEC-2 benchmark suit confirm an average of 49 times improvement in cache lifetime and more than 50% reduction in cache power consumption when compared to baseline configurations.

101 citations


Proceedings ArticleDOI
04 Jun 2011
TL;DR: This paper researches the integration of STT-RAM in a 3D multi-core environment and proposes solutions at the on-chip network level to circumvent the write overhead problem in the cache architecture with STt-RAM technology.
Abstract: Emerging memory technologies such as STT-RAM, PCRAM, and resistive RAM are being explored as potential replacements to existing on-chip caches or main memories for future multi-core architectures. This is due to the many attractive features these memory technologies posses: high density, low leakage, and non-volatility. However, the latency and energy overhead associated with the write operations of these emerging memories has become a major obstacle in their adoption. Previous works have proposed various circuit and architectural level solutions to mitigate the write overhead. In this paper, we study the integration of STT-RAM in a 3D multi-core environment and propose solutions at the on-chip network level to circumvent the write overhead problem in the cache architecture with STT-RAM technology. Our scheme is based on the observation that instead of staggering requests to a write-busy STT-RAM bank, the network should schedule requests to other idle cache banks for effectively hiding the latency. Thus, we prioritize cache accesses to the idle banks by delaying accesses to the STT-RAM cache banks that are currently serving long latency write requests. Through a detailed characterization of the cache access patterns of 42 applications, we propose an efficient mechanism to facilitate such delayed writes to cache banks by (a) accurately estimating the busy time of each cache bank through logical partitioning of the cache layer and (b) prioritizing packets in a router requesting accesses to idle banks. Evaluations on a 3D architecture, consisting of 64 cores and 64 STT-RAM cache banks, show that our proposed approach provides 14% average IPC improvement for multi-threaded benchmarks, 19% instruction throughput benefits for multi-programmed workloads, and 6% latency reduction compared to a recently proposed write buffering mechanism.

Patent
31 Oct 2011
TL;DR: In this paper, the authors present a processor with a plurality of cores and a cache memory coupled to the cores and including a pluralityof partitions, which can dynamically vary a size of the cache memory based on a memory boundedness of a workload executed on at least one of the cores.
Abstract: In one embodiment, the present invention is directed to a processor having a plurality of cores and a cache memory coupled to the cores and including a plurality of partitions. The processor can further include a logic to dynamically vary a size of the cache memory based on a memory boundedness of a workload executed on at least one of the cores. Other embodiments are described and claimed.

Proceedings ArticleDOI
11 Apr 2011
TL;DR: A refined persistence analysis method is presented which fixes the potential underestimation problem in the original persistence analysis and a framework to combine access pattern analysis and abstract interpretation for accurate data cache analysis is proposed.
Abstract: Caches are widely used in modern computer systems to bridge the increasing gap between processor speed and memory access time. On the other hand, presence of caches, especially data caches, complicates the static worst case execution time (WCET) analysis. Access pattern analysis (e.g., cache miss equations) are applicable to only a specific class of programs, where all array accesses must have predictable access patterns. Abstract interpretation-based methods (must/persistence analysis) determines possible cache conflicts based on coarse-grained memory access information from address analysis, which usually leads to significantly pessimistic estimation. In this paper, we first present a refined persistence analysis method which fixes the potential underestimation problem in the original persistence analysis. Based on our new persistence analysis, we propose a framework to combine access pattern analysis and abstract interpretation for accurate data cache analysis. We capture the dynamic behavior of a memory access by computing its temporal scope (the loop iterations where a given memory block is accessed for a given data reference) during address analysis. Temporal scopes as well as loop hierarchy structure (the static scopes) are integrated and utilized to achieve a more precise abstract cache state modeling. Experimental results shows that our proposed analysis obtains up to 74% reduction in the WCET estimates compared to existing data cache analysis.

Patent
30 Dec 2011
TL;DR: In this article, the authors present a method, computer program product, and computing system for receiving an indication that a virtual machine is going to be migrated from a first operating environment to a second operating environment.
Abstract: A method, computer program product, and computing system for receiving an indication that a virtual machine is going to be migrated from a first operating environment to a second operating environment. The mode of operation of a cache system associated with the virtual machine is downgraded. Content included within a memory device currently associated with the cache system is copied to a memory device to be associated with the cache system. The memory device currently associated with the cache system is detached from the virtual machine. The virtual machine is migrated from the first operating environment to the second operating environment.

Proceedings ArticleDOI
29 Nov 2011
TL;DR: This paper introduces a new method of bounding pre-emption costs, called the ECB-Union approach, which complements an existing UCB- Union approach and combines the two into a simple composite approach that dominates both.
Abstract: Without the use of cache the increasing gap between processor and memory speeds in modern embedded microprocessors would have resulted in memory access times becoming an unacceptable bottleneck. In such systems, cache related pre-emption delays can be a significant proportion of task execution times. To obtain tight bounds on the response times of tasks in pre-emptively scheduled systems, it is necessary to integrate worst-case execution time analysis and schedulability analysis via the use of an appropriate model of pre-emption costs. In this paper, we introduce a new method of bounding pre-emption costs, called the ECB-Union approach. The ECB-Union approach complements an existing UCB-Union approach. We combine the two into a simple composite approach that dominates both. These approaches are integrated into response time analysis for fixed priority pre-emptively scheduled systems. Further, we extend this analysis to systems where tasks can access resources in mutual exclusion, in the process resolving omissions in existing models of pre-emption delays. A case study and empirical evaluation demonstrate the e?ectiveness of the ECB-Union and combined approaches for a wide range of di?erent cache configurations including cache utilization, cache set size, reuse, and block reload times.

Patent
03 Jun 2011
TL;DR: In this article, a solid state drive may be used as a log structured cache, may employ multi-level metadata management, may use read and write gating, and may accelerate access to other storage media.
Abstract: Examples of described systems utilize a cache media in one or more computing devices that may accelerate access to other storage media. A solid state drive may be used as the local cache media. In some embodiments, the solid state drive may be used as a log structured cache, may employ multi-level metadata management, may use read and write gating.

Proceedings ArticleDOI
09 Mar 2011
TL;DR: A novel approach is presented that efficiently analyzes interactions between threads to determine thread correlation and detect true and false sharing, and is able to improve the performance of some applications up to a factor of 12x and shed light on the obstacles that prevent their performance from scaling to many cores.
Abstract: In today's multi-core systems, cache contention due to true and false sharing can cause unexpected and significant performance degradation. A detailed understanding of a given multi-threaded application's behavior is required to precisely identify such performance bottlenecks. Traditionally, however, such diagnostic information can only be obtained after lengthy simulation of the memory hierarchy.In this paper, we present a novel approach that efficiently analyzes interactions between threads to determine thread correlation and detect true and false sharing. It is based on the following key insight: although the slowdown caused by cache contention depends on factors including the thread-to-core binding and parameters of the memory hierarchy, the amount of data sharing is primarily a function of the cache line size and application behavior. Using memory shadowing and dynamic instrumentation, we implemented a tool that obtains detailed sharing information between threads without simulating the full complexity of the memory hierarchy. The runtime overhead of our approach --- a 5x slowdown on average relative to native execution --- is significantly less than that of detailed cache simulation. The information collected allows programmers to identify the degree of cache contention in an application, the correlation among its threads, and the sources of significant false sharing. Using our approach, we were able to improve the performance of some applications up to a factor of 12x. For other contention-intensive applications, we were able to shed light on the obstacles that prevent their performance from scaling to many cores.

Proceedings ArticleDOI
01 Oct 2011
TL;DR: This paper proposes to integrate SRAM with STT-RAM to construct a novel hybrid cache architecture for CMPs and proposes dedicated microarchitectural mechanisms to make the hybrid cache robust to workloads with different write patterns.
Abstract: Modern high performance Chip Multiprocessor (CMP) systems rely on large on-chip cache hierarchy. As technology scales down, the leakage power of present SRAM based cache gradually dominates the on-chip power consumption, which can severely jeopardize system performance. The emerging nonvolatile Spin Transfer Torque RAM (STT-RAM) is a promising candidate for large on-chip cache because of the ultra low leakage power. However, the write operations on STT-RAM suffer from considerably higher energy as well as longer latency compared with SRAM which will make STT-RAM in trouble for write-intensive workloads. In this paper, we propose to integrate SRAM with STT-RAM to construct a novel hybrid cache architecture for CMPs. We also propose dedicated microarchitectural mechanisms to make the hybrid cache robust to workloads with different write patterns. Extensive simulation results demonstrate that the proposed hybrid scheme is adaptive to variations of workloads. Overall power consumption is reduced by 37.1% and performance is improved by 23.6% on average compared with SRAM based static NUCA under the same area configuration.

Proceedings ArticleDOI
04 Jun 2011
TL;DR: The parallel cache-oblivious (PCO) model is presented, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies, and a new scheduler is described, which attains provably good cache performance and runtime on parallel machine models with hierarchical caches.
Abstract: For nested-parallel computations with low depth (span, critical path length) analyzing the work, depth, and sequential cache complexity suffices to attain reasonably strong bounds on the parallel runtime and cache complexity on machine models with either shared or private caches. These bounds, however, do not extend to general hierarchical caches, due to limitations in (i) the cache-oblivious (CO) model used to analyze cache complexity and (ii) the schedulers used to map computation tasks to processors. This paper presents the parallel cache-oblivious (PCO) model, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies. The first change is to avoid capturing artificial data sharing among parallel threads, and the second is to account for parallelism-memory imbalances within tasks. Despite the more restrictive nature of PCO compared to CO, many algorithms have the same asymptotic cache complexity bounds.The paper then describes a new scheduler for hierarchical caches, which extends recent work on "space-bounded schedulers" to allow for computations with arbitrary work imbalance among parallel subtasks. This scheduler attains provably good cache performance and runtime on parallel machine models with hierarchical caches, for nested-parallel computations analyzed using the PCO model. We show that under reasonable assumptions our scheduler is "work efficient" in the sense that the cost of the cache misses are evenly balanced across the processors---i.e., the runtime can be determined within a constant factor by taking the total cost of the cache misses analyzed for a computation and dividing it by the number of processors. In contrast, to further support our model, we show that no scheduler can achieve such bounds (optimizing for both cache misses and runtime) if work, depth, and sequential cache complexity are the only parameters used to analyze a computation.

Patent
02 Nov 2011
TL;DR: In this paper, a multi-level cache comprises a plurality of cache levels, each configured to cache I/O request data pertaining to requests of a different respective type and/or granularity.
Abstract: A multi-level cache comprises a plurality of cache levels, each configured to cache I/O request data pertaining to I/O requests of a different respective type and/or granularity A cache device manager may allocate cache storage space to each of the cache levels Each cache level maintains respective cache metadata that associates I/O request data with respective cache address The cache levels monitor I/O requests within a storage stack, apply selection criteria to identify cacheable I/O requests, and service cacheable I/O requests using the cache storage device

Proceedings ArticleDOI
05 Jun 2011
TL;DR: This paper presents a novel energy optimization technique which employs both dynamic reconfiguration of private caches and partitioning of the shared cache for multicore systems with real-time tasks and can achieve 29.29% energy saving on average.
Abstract: Multicore architectures, especially chip multi-processors, have been widely acknowledged as a successful design paradigm. Existing approaches primarily target application-driven partitioning of the shared cache to alleviate inter-core cache interference so that both performance and energy efficiency are improved. Dynamic cache reconfiguration is a promising technique in reducing energy consumption of the cache subsystem for uniprocessor systems. In this paper, we present a novel energy optimization technique which employs both dynamic reconfiguration of private caches and partitioning of the shared cache for multicore systems with real-time tasks. Our static profiling based algorithm is designed to judiciously find beneficial cache configurations (of private caches) for each task as well as partition factors (of the shared cache) for each core so that the energy consumption is minimized while task deadline is satisfied. Experimental results using real benchmarks demonstrate that our approach can achieve 29.29% energy saving on average compared to systems employing only cache partitioning.

Proceedings ArticleDOI
12 Feb 2011
TL;DR: This work proposes a novel scalable cache management framework called CloudCache that creates dynamically expanding and shrinking L2 caches for working threads with fine-grained hardware monitoring and control and demonstrates that CloudCache significantly improves performance of a wide range of workloads when all or a subset of cores are occupied.
Abstract: The number of cores in a single chip multiprocessor is expected to grow in coming years. Likewise, aggregate on-chip cache capacity is increasing fast and its effective utilization is becoming ever more important. Furthermore, available cores are expected to be underutilized due to the power wall and highly heterogeneous future workloads. This trend makes existing L2 cache management techniques less effective for two problems: increased capacity interference between working cores and longer L2 access latency. We propose a novel scalable cache management framework called CloudCache that creates dynamically expanding and shrinking L2 caches for working threads with fine-grained hardware monitoring and control. The key architectural components of CloudCache are L2 cache chaining, inter- and intra-bank cache partitioning, and a performance-optimized coherence protocol. Our extensive experimental evaluation demonstrates that CloudCache significantly improves performance of a wide range of workloads when all or a subset of cores are occupied.

Patent
30 Dec 2011
TL;DR: In this article, a method, computer program product, and computing system for copying a cache system from a first machine to a second machine, wherein the cache system includes cache content and a content directory, thus generating a duplicate cache system on the second machine.
Abstract: A method, computer program product, and computing system for copying a cache system from a first machine to a second machine, wherein the cache system includes cache content and a content directory, thus generating a duplicate cache system on the second machine. The duplicate cache system includes duplicate cache content and a duplicate content directory. A plurality of data requests concerning a plurality of data actions to be taken on a data array associated with the first machine are received on the first machine. The plurality of data requests are stored on a tracking queue included within the data array associated with the first machine.

Proceedings ArticleDOI
13 Sep 2011
TL;DR: A low-overhead method for accurately measuring application performance (CPI) and off-chip bandwidth (GB/s) as a function of available shared cache capacity is presented.
Abstract: We present a low-overhead method for accurately measuring application performance (CPI) and off-chip bandwidth (GB/s) as a function of available shared cache capacity. The method is implemented on real hardware, with no modifications to the application or operating system. We accomplish this by co-running a Pirate application that "steals" cache space with the Target application. By adjusting how much space the Pirate steals during the Target's execution, and using hardware performance counters to record the Target's performance, we can accurately and efficiently capture performance data for the Target application as a function of its available shared cache. At the same time we use performance counters to monitor the Pirate to ensure that it is successfully stealing the desired amount of cache. To evaluate this approach, we show that 1) the cache available to the Target behaves as expected, 2) the Pirate steals the desired amount of cache, and ) the Pirate does not bias the Target's performance. As a result, we are able to accurately measure the Target's performance while stealing up to an average of 6.8MB of the 8MB of cache on our Nehalem based test system with an average measurement overhead of only 5.5%.

Patent
15 Jul 2011
TL;DR: In this article, a pipelined processor comprising a cache memory system, fetching instructions for execution from a portion of said cache memory systems, an instruction commencing processing before a digital signature of the cache line that contained the instruction is verified against a reference signature of cache line, the verification being done at the point of decoding, dispatching, or committing execution of the instruction, the reference signature being stored in an encrypted form in the processor's memory, and the key for decrypting the said reference signature was stored in a secure storage location.
Abstract: A pipelined processor comprising a cache memory system, fetching instructions for execution from a portion of said cache memory system, an instruction commencing processing before a digital signature of the cache line that contained the instruction is verified against a reference signature of the cache line, the verification being done at the point of decoding, dispatching, or committing execution of the instruction, the reference signature being stored in an encrypted form in the processor's memory, and the key for decrypting the said reference signature being stored in a secure storage location. The instruction processing proceeds when the two signatures exactly match and, where further instruction processing is suspended or processing modified on a mismatch of the two said signatures.

Patent
30 Sep 2011
TL;DR: In this paper, a change in workload characteristics detected at one tier of a multi-tier cache is communicated to another tier of the multi-tiered cache, and at least one cache element is dynamically resizable.
Abstract: A change in workload characteristics detected at one tier of a multi-tiered cache is communicated to another tier of the multi-tiered cache. Multiple caching elements exist at different tiers, and at least one tier includes a cache element that is dynamically resizable. The communicated change in workload characteristics causes the receiving tier to adjust at least one aspect of cache performance in the multi-tiered cache. In one aspect, at least one dynamically resizable element in the multi-tiered cache is resized responsive to the change in workload characteristics.

Patent
25 Jul 2011
TL;DR: In this paper, the main memory of the controller is not blocked by a complete address mapping table covering the entire memory device, instead such table is stored in the memory device itself, and only selected portions of address mapping information are buffered in a read cache (311) and a write cache (312), enabling an address mapping entry being evictable from the read cache without the need to update the related flash memory page storing such entry in the flash memory device.
Abstract: The present idea provides a high read and write performance from/to a solid state memory device. The main memory (31) of the controller (1) is not blocked by a complete address mapping table covering the entire memory device (2). Instead such table is stored in the memory device (2) itself, and only selected portions of address mapping information are buffered in the main memory (31) in a read cache (311) and a write cache (312). A separation of the read cache (311) from the write cache (312) enables an address mapping entry being evictable from the read cache (311) without the need to update the related flash memory page storing such entry in the flash memory device (2). By this design, the read cache (311) may advantageously be stored on a DRAM even without power down protection, while the write cache (312) may preferably be implemented in nonvolatile or other fail-safe memory. This leads to a reduction of the overall provisioning of nonvolatile or fail-safe memory and to an improved scalability and performance.

Book
23 May 2011
TL;DR: The book attempts a synthesis of recent cache research that has focused on innovations for multi-core processors and is an excellent starting point for early-stage graduate students, researchers, and practitioners who wish to understand the landscape of recent Cache research.
Abstract: A key determinant of overall system performance and power dissipation is the cache hierarchy since access to off-chip memory consumes many more cycles and energy than on-chip accesses. In addition, multi-core processors are expected to place ever higher bandwidth demands on the memory system. All these issues make it important to avoid off-chip memory access by improving the efficiency of the on-chip cache. Future multi-core processors will have many large cache banks connected by a network and shared by many cores. Hence, many important problems must be solved: cache resources must be allocated across many cores, data must be placed in cache banks that are near the accessing core, and the most important data must be identified for retention. Finally, difficulties in scaling existing technologies require adapting to and exploiting new technology constraints. The book attempts a synthesis of recent cache research that has focused on innovations for multi-core processors. It is an excellent starting point for early-stage graduate students, researchers, and practitioners who wish to understand the landscape of recent cache research. The book is suitable as a reference for advanced computer architecture classes as well as for experienced researchers and VLSI engineers. Table of Contents: Basic Elements of Large Cache Design / Organizing Data in CMP Last Level Caches / Policies Impacting Cache Hit Rates / Interconnection Networks within Large Caches / Technology / Concluding Remarks

Patent
20 May 2011
TL;DR: In this paper, a cache memory system that uses multi-bit error correcting code (ECC) with a low storage and complexity overhead is presented, without dramatically increasing transition latency to and from an idle power state due to loss of state.
Abstract: A cache memory system is provided that uses multi-bit Error Correcting Code (ECC) with a low storage and complexity overhead. The cache memory system can be operated at very low idle power, without dramatically increasing transition latency to and from an idle power state due to loss of state.

Patent
Ali Mashtizadeh1, Irfan Ahmad1
13 Jul 2011
TL;DR: In this paper, a technique for managing memory within a virtualized system that includes a memory compression cache is described. But this technique is limited to the first-in-touch-out (FITO) list.
Abstract: Techniques are disclosed for managing memory within a virtualized system that includes a memory compression cache. Generally, the virtualized system may include a hypervisor configured to use a compression cache to temporarily store memory pages that have been compressed to conserve memory space. A “first-in touch-out” (FITO) list may be used to manage the size of the compression cache by monitoring the compressed memory pages in the compression cache. Each element in the FITO list corresponds to a compressed page in the compression cache. Each element in the FITO list records a time at which the corresponding compressed page was stored in the compression cache (i.e. an age). A size of the compression cache may be adjusted based on the ages of the pages in the compression cache.