scispace - formally typeset
Search or ask a question

Showing papers on "Cache algorithms published in 2011"


Proceedings ArticleDOI
03 Dec 2011
TL;DR: Die-stacking technology enables multiple layers of DRAM to be integrated with multicore processors, but a promising use of stacked DRAM is as a cache, since its capacity is insufficient to be all of main memory.
Abstract: Die-stacking technology enables multiple layers of DRAM to be integrated with multicore processors. A promising use of stacked DRAM is as a cache, since its capacity is insufficient to be all of main memory (for all but some embedded systems). However, a 1GB DRAM cache with 64-byte blocks requires 96MB of tag storage. Placing these tags on-chip is impractical (larger than on-chip L3s) while putting them in DRAM is slow (two full DRAM accesses for tag and data). Larger blocks and sub-blocking are possible, but less robust due to fragmentation.This work efficiently enables conventional block sizes for very large die-stacked DRAM caches with two innovations. First, we make hits faster than just storing tags in stacked DRAM by scheduling the tag and data accesses as a compound access so the data access is always a row buffer hit. Second, we make misses faster with a MissMap that eschews stacked-DRAM access on all misses. Like extreme sub-blocking, our implementation of the MissMap stores a vector of block-valid bits for each "page" in the DRAM cache. Unlike conventional sub-blocking, the MissMap (a) points to many more pages than can be stored in the DRAM cache (making the effects of fragmentation rare) and (b) does not point to the "way" that holds a block (but defers to the off-chip tags).For the evaluated large-footprint commercial workloads, the proposed cache organization delivers 92.9% of the performance benefit of an ideal 1GB DRAM cache with an impractical 96MB on-chip SRAM tag array.

270 citations


Proceedings ArticleDOI
03 Dec 2011
TL;DR: This paper proposes a novel Signature-based Hit Predictor (SHiP) to learn the re-reference behavior of cache lines belonging to each signature, and finds that SHiP offers substantial improvements over the baseline LRU replacement and state-of-the-art replacement policy proposals.
Abstract: The shared last-level caches in CMPs play an important role in improving application performance and reducing off-chip memory bandwidth requirements. In order to use LLCs more efficiently, recent research has shown that changing the re-reference prediction on cache insertions and cache hits can significantly improve cache performance. A fundamental challenge, however, is how to best predict the re-reference pattern of an incoming cache line. This paper shows that cache performance can be improved by correlating the re-reference behavior of a cache line with a unique signature. We investigate the use of memory region, program counter, and instruction sequence history based signatures. We also propose a novel Signature-based Hit Predictor (SHiP) to learn the re-reference behavior of cache lines belonging to each signature. Overall, we find that SHiP offers substantial improvements over the baseline LRU replacement and state-of-the-art replacement policy proposals. On average, SHiP improves sequential and multiprogrammed application performance by roughly 10% and 12% over LRU replacement, respectively. Compared to recent replacement policy proposals such as Seg-LRU and SDBP, SHiP nearly doubles the performance gains while requiring less hardware overhead.

235 citations


Proceedings ArticleDOI
03 Dec 2011
TL;DR: The experiments show that on the average, the proposed multi retention level STT-RAM cache reduces 30 ∼ 70% of the total energy compared to previous works, while improving IPC performance for both 2-level and 3-level cache hierarchy.
Abstract: Spin-transfer torque random access memory (STT-RAM) has received increasing attention because of its attractive features: good scalability, zero standby power, non-volatility and radiation hardness. The use of STT-RAM technology in the last level on-chip caches has been proposed as it minimizes cache leakage power with technology scaling down. Furthermore, the cell area of STT-RAM is only 1/9 ~ 1/3 that of SRAM. This allows for a much larger cache with the same die footprint, improving overall system performance through reducing cache misses. However, deploying STT-RAM technology in L1 caches is challenging because of the long and power-consuming write operations. In this paper, we propose both L1 and lower level cache designs that use STT-RAM. In particular, our designs use STT-RAM cells with various data retention time and write performances, made possible by different magnetic tunneling junction (MTJ) designs. For the fast STT-RAM bits with reduced data retention time, a counter controlled dynamic refresh scheme is proposed to maintain the data validity. Our dynamic scheme saves more than 80% refresh energy compared to the simple refresh scheme proposed in previous works. A L1 cache built with ultra low retention STT-RAM coupled with our proposed dynamic refresh scheme can achieve 9.2% in performance improvement, and saves up to 30% of the total energy when compared to one that uses traditional SRAM. For lower level caches with relative large cache capacity, we propose a data migration scheme that moves data between portions of the cache with different retention characteristics so as to maximize the performance and power benefits. Our experiments show that on the average, our proposed multi retention level STT-RAM cache reduces 30 ~ 70% of the total energy compared to previous works, while improving IPC performance for both 2-level and 3-level cache hierarchy.

234 citations


Proceedings ArticleDOI
04 Jun 2011
TL;DR: This work presents Vantage, a novel cache partitioning technique that overcomes the limitations of existing schemes: caches can have tens of partitions with sizes specified at cache line granularity, while maintaining high associativity and strong isolation among partitions.
Abstract: Cache partitioning has a wide range of uses in CMPs, from guaranteeing quality of service and controlled sharing to security-related techniques. However, existing cache partitioning schemes (such as way-partitioning) are limited to coarse-grain allocations, can only support few partitions, and reduce cache associativity, hurting performance. Hence, these techniques can only be applied to CMPs with 2-4 cores, but fail to scale to tens of cores. We present Vantage, a novel cache partitioning technique that overcomes the limitations of existing schemes: caches can have tens of partitions with sizes specified at cache line granularity, while maintaining high associativity and strong isolation among partitions. Vantage leverages cache arrays with good hashing and associativity, which enable soft-pinning a large portion of cache lines. It enforces capacity allocations by controlling the replacement process. Unlike prior schemes, Vantage provides strict isolation guarantees by partitioning most (e.g. 90%) of the cache instead of all of it. Vantage is derived from analytical models, which allow us to provide strong guarantees and bounds on associativity and sizing independent of the number of partitions and their behaviors. It is simple to implement, requiring around 1.5% state overhead and simple changes to the cache controller. We evaluate Vantage using extensive simulations. On a 32-core system, using 350 multiprogrammed workloads and one partition per core, partitioning the last-level cache with conventional techniques degrades throughput for 71% of the workloads versus an unpartitioned cache (by 7% average, 25% maximum degradation), even when using 64-way caches. In contrast, Vantage improves throughput for 98% of the workloads, by 8% on average (up to 20%), using a 4-way cache.

229 citations


Proceedings ArticleDOI
04 Jun 2011
TL;DR: This paper proposes a novel cache architecture that uses variable-strength error-correcting codes (VS-ECC), which significantly reduces power and energy, avoids significant reductions in cache capacity, incurs little area overhead, and avoids large increases in latency and bandwidth.
Abstract: Voltage scaling is one of the most effective mechanisms to improve microprocessors' energy efficiency. However, processors cannot operate reliably below a minimum voltage, Vccmin, since hardware structures may fail. Cell failures in large memory arrays (e.g., caches) typically determine Vccmin for the whole processor. We observe that most cache lines exhibit zero or one failures at low voltages. However, a few lines, especially in large caches, exhibit multi-bit failures and increase Vccmin. Previous solutions either significantly reduce cache capacity to enable uniform error correction across all lines, or significantly increase latency and bandwidth overheads when amortizing the cost of error-correcting codes (ECC) over large lines. In this paper, we propose a novel cache architecture that uses variable-strength error-correcting codes (VS-ECC). In the common case, lines with zero or one failures use a simple and fast ECC. A small number of lines with multi-bit failures use a strong multi-bit ECC that requires some additional area and latency. We present a novel dynamic cache characterization mechanism to determine which lines will exhibit multi-bit failures. In particular, we use multi-bit correction to protect a fraction of the cache after switching to low voltage, while dynamically testing the remaining lines for multi-bit failures. Compared to prior multi-bit-correcting proposals, VS-ECC significantly reduces power and energy, avoids significant reductions in cache capacity, incurs little area overhead, and avoids large increases in latency and bandwidth.

167 citations


Proceedings ArticleDOI
27 Jun 2011
TL;DR: It is shown that stealing crypto keys in a virtualized cloud may be a real threat by evaluating a cache-based side-channel attack against an encryption process and proposing an approach that leverages dynamic cache coloring: when an application is doing security-sensitive operations, the VMM is notified to swap the associated data to a safe and isolated cache line.
Abstract: Multi-tenant cloud, which features utility-like computing resources to tenants in a “pay-as-you-go” style, has been commercially popular for years. As one of the sole purposes of such a cloud is maximizing resource usages to increase its revenue, it usually uses virtualization to consolidate VMs from different and even mutually-malicious tenants atop a powerful physical machine. This, however, also enables a malicious tenant to steal security-critical information such as crypto keys from victims, due to the shared physical resources such as caches. In this paper, we show that stealing crypto keys in a virtualized cloud may be a real threat by evaluating a cache-based side-channel attack against an encryption process. To mitigate such attacks while not notably degrading performance, we propose an approach that leverages dynamic cache coloring: when an application is doing security-sensitive operations, the VMM is notified to swap the associated data to a safe and isolated cache line. This approach may eliminate cache-based side-channel for security-critical operations, yet ensure efficient resource sharing during normal operations. We demonstrate the applicability by illustrating a preliminary implementation based on Xen and its performance overhead.

164 citations


Patent
01 Nov 2011
TL;DR: In this paper, a cache-defeat detection system and methods for caching of content addressed by identifiers intended to defeat cache are further disclosed, which can detect a data request to a content source for which content received is stored as cache elements in a local cache on the mobile device, determining, from an identifier of the data request, that a cache defeating mechanism is used by the content source, and/or retrieving content from the cached elements in the local cache to respond to the request.
Abstract: Systems and methods for cache defeat detection are disclosed. Moreover, systems and methods for caching of content addressed by identifiers intended to defeat cache are further disclosed. In one aspect, embodiments of the present disclosure include a method, which may be implemented on a system, of resource management in a wireless network by caching content on a mobile device. The method can include detecting a data request to a content source for which content received is stored as cache elements in a local cache on the mobile device, determining, from an identifier of the data request, that a cache defeating mechanism is used by the content source, and/or retrieving content from the cache elements in the local cache to respond to the data request.

161 citations


Proceedings ArticleDOI
03 Dec 2011
TL;DR: This paper characterizes the performance of state-of-the-art LLC management policies in the presence and absence of hardware prefetching, and proposes Prefetch-Aware Cache Management (PACMan), which dynamically estimates and mitigates the degree of prefetch-induced cache interference.
Abstract: Hardware prefetching and last-level cache (LLC) management are two independent mechanisms to mitigate the growing latency to memory. However, the interaction between LLC management and hardware prefetching has received very little attention. This paper characterizes the performance of state-of-the-art LLC management policies in the presence and absence of hardware prefetching. Although prefetching improves performance by fetching useful data in advance, it can interact with LLC management policies to introduce application performance variability. This variability stems from the fact that current replacement policies treat prefetch and demand requests identically. In order to provide better and more predictable performance, we propose Prefetch-Aware Cache Management (PACMan). PACMan dynamically estimates and mitigates the degree of prefetch-induced cache interference by modifying the cache insertion and hit promotion policies to treat demand and prefetch requests differently. Across a variety of emerging workloads, we show that PACMan eliminates the performance variability in state-of-the-art replacement policies under the influence of prefetching. In fact, PACMan improves performance consistently across multimedia, games, server, and SPEC CPU2006 workloads by an average of 21.9% over the baseline LRU policy. For multiprogrammed workloads, on a 4-core CMP, PACMan improves performance by 21.5% on average.

112 citations


Proceedings ArticleDOI
Zoltan Majo1, Thomas R. Gross1
04 Jun 2011
TL;DR: This work presents a detailed analysis of a commercially available NUMA-multicore architecture, the Intel Nehalem, and describes two scheduling algorithms: maximum-local, which optimizes for maximum data locality, and N-MASS, which reduces data locality to avoid the performance degradation caused by cache contention.
Abstract: Multiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, the issues of memory management and process mapping must be revisited. We find that optimizing only for data locality can counteract the benefits of cache contention avoidance and vice versa. Therefore, system software must take both data locality and cache contention into account to achieve good performance, and memory management cannot be decoupled from process scheduling. We present a detailed analysis of a commercially available NUMA-multicore architecture, the Intel Nehalem. We describe two scheduling algorithms: maximum-local, which optimizes for maximum data locality, and its extension, N-MASS, which reduces data locality to avoid the performance degradation caused by cache contention. N-MASS is fine-tuned to support memory management on NUMA-multicores and improves performance up to 32%, and 7% on average, over the default setup in current Linux implementations.

110 citations


Journal ArticleDOI
TL;DR: A novel price-demand model designed for a cloud cache and a dynamic pricing scheme for queries executed in the cloud cache is proposed that employs a novel method that estimates the correlations of the cache services in an time-efficient manner.
Abstract: Cloud applications that offer data management services are emerging. Such clouds support caching of data in order to provide quality query services. The users can query the cloud data, paying the price for the infrastructure they use. Cloud management necessitates an economy that manages the service of multiple users in an efficient, but also, resource-economic way that allows for cloud profit. Naturally, the maximization of cloud profit given some guarantees for user satisfaction presumes an appropriate price-demand model that enables optimal pricing of query services. The model should be plausible in that it reflects the correlation of cache structures involved in the queries. Optimal pricing is achieved based on a dynamic pricing scheme that adapts to time changes. This paper proposes a novel price-demand model designed for a cloud cache and a dynamic pricing scheme for queries executed in the cloud cache. The pricing solution employs a novel method that estimates the correlations of the cache services in an time-efficient manner. The experimental study shows the efficiency of the solution.

107 citations


Patent
12 Aug 2011
TL;DR: In this paper, a storage request module detects an input/output (I/O) request for a storage device cached by solid-state storage media of a cache, and a direct mapping module references a single mapping structure to determine that the cache comprises data of the I/O request.
Abstract: An apparatus, system, and method are disclosed for caching data. A storage request module detects an input/output (“I/O”) request for a storage device cached by solid-state storage media of a cache. A direct mapping module references a single mapping structure to determine that the cache comprises data of the I/O request. The single mapping structure maps each logical block address of the storage device directly to a logical block address of the cache. The single mapping structure maintains a fully associative relationship between logical block addresses of the storage device and physical storage addresses on the solid-state storage media. A cache fulfillment module satisfies the I/O request using the cache in response to the direct mapping module determining that the cache comprises at least one data block of the I/O request.

Proceedings ArticleDOI
26 Oct 2011
TL;DR: It is shown how a small, fast popularity-based front-end cache can ensure load balancing for an important class of cloud computing services and it is proved an O(n log n) lower-bound on the necessary cache size and shown that this size depends only on the total number of back-end nodes n.
Abstract: Load balancing requests across a cluster of back-end servers is critical for avoiding performance bottlenecks and meeting service-level objectives (SLOs) in large-scale cloud computing services. This paper shows how a small, fast popularity-based front-end cache can ensure load balancing for an important class of such services; furthermore, we prove an O(n log n) lower-bound on the necessary cache size and show that this size depends only on the total number of back-end nodes n, not the number of items stored in the system. We validate our analysis through simulation and empirical results running a key-value storage system on an 85-node cluster.

Proceedings Article
27 Jul 2011
TL;DR: This paper presents three kinds of caches to store relevant document-level information: a dynamic cache, which stores bilingual phrase pairs from the best translation hypotheses of previous sentences in the test document; a static cache,which stores relevantilingual phrase pairs extracted from similar bilingual document pairs in the training parallel corpus; and a topic cache,Which stores the target-side topic words related with the test documents in the source-side.
Abstract: Statistical machine translation systems are usually trained on a large amount of bilingual sentence pairs and translate one sentence at a time, ignoring document-level information In this paper, we propose a cache-based approach to document-level translation Since caches mainly depend on relevant data to supervise subsequent decisions, it is critical to fill the caches with highly-relevant data of a reasonable size In this paper, we present three kinds of caches to store relevant document-level information: 1) a dynamic cache, which stores bilingual phrase pairs from the best translation hypotheses of previous sentences in the test document; 2) a static cache, which stores relevant bilingual phrase pairs extracted from similar bilingual document pairs (ie source documents similar to the test document and their corresponding target documents) in the training parallel corpus; 3) a topic cache, which stores the target-side topic words related with the test document in the source-side In particular, three new features are designed to explore various kinds of document-level information in above three kinds of caches Evaluation shows the effectiveness of our cache-based approach to document-level translation with the performance improvement of 081 in BLUE score over Moses Especially, detailed analysis and discussion are presented to give new insights to document-level translation

Proceedings ArticleDOI
01 Aug 2011
TL;DR: A low-overhead, fully-hardware technique is utilized to detect write-intensive data blocks of working set and place them into SRAM lines while the remaining data blocks are candidates to be remapped onto STT-RAM blocks during system operation.
Abstract: In this paper, we propose a run-time strategy for managing writes onto last level cache in chip multiprocessors where STT-RAM memory is used as baseline technology. To this end, we assume that each cache set is decomposed into limited SRAM lines and large number of STT-RAM lines. SRAM lines are target of frequently-written data and rarely-written or read-only ones are pushed into STT-RAM. As a novel contribution, a low-overhead, fully-hardware technique is utilized to detect write-intensive data blocks of working set and place them into SRAM lines while the remaining data blocks are candidates to be remapped onto STT-RAM blocks during system operation. Therefore, the achieved cache architecture has large capacity and consumes near zero leakage energy using STT-RAM array; while dynamic write energy, acceptable write latency, and long lifetime is guaranteed via SRAM array. Results of full-system simulation for a quad-core CMP running PARSEC-2 benchmark suit confirm an average of 49 times improvement in cache lifetime and more than 50% reduction in cache power consumption when compared to baseline configurations.

Proceedings ArticleDOI
04 Jun 2011
TL;DR: This paper researches the integration of STT-RAM in a 3D multi-core environment and proposes solutions at the on-chip network level to circumvent the write overhead problem in the cache architecture with STt-RAM technology.
Abstract: Emerging memory technologies such as STT-RAM, PCRAM, and resistive RAM are being explored as potential replacements to existing on-chip caches or main memories for future multi-core architectures. This is due to the many attractive features these memory technologies posses: high density, low leakage, and non-volatility. However, the latency and energy overhead associated with the write operations of these emerging memories has become a major obstacle in their adoption. Previous works have proposed various circuit and architectural level solutions to mitigate the write overhead. In this paper, we study the integration of STT-RAM in a 3D multi-core environment and propose solutions at the on-chip network level to circumvent the write overhead problem in the cache architecture with STT-RAM technology. Our scheme is based on the observation that instead of staggering requests to a write-busy STT-RAM bank, the network should schedule requests to other idle cache banks for effectively hiding the latency. Thus, we prioritize cache accesses to the idle banks by delaying accesses to the STT-RAM cache banks that are currently serving long latency write requests. Through a detailed characterization of the cache access patterns of 42 applications, we propose an efficient mechanism to facilitate such delayed writes to cache banks by (a) accurately estimating the busy time of each cache bank through logical partitioning of the cache layer and (b) prioritizing packets in a router requesting accesses to idle banks. Evaluations on a 3D architecture, consisting of 64 cores and 64 STT-RAM cache banks, show that our proposed approach provides 14% average IPC improvement for multi-threaded benchmarks, 19% instruction throughput benefits for multi-programmed workloads, and 6% latency reduction compared to a recently proposed write buffering mechanism.

Patent
31 Oct 2011
TL;DR: In this paper, the authors present a processor with a plurality of cores and a cache memory coupled to the cores and including a pluralityof partitions, which can dynamically vary a size of the cache memory based on a memory boundedness of a workload executed on at least one of the cores.
Abstract: In one embodiment, the present invention is directed to a processor having a plurality of cores and a cache memory coupled to the cores and including a plurality of partitions. The processor can further include a logic to dynamically vary a size of the cache memory based on a memory boundedness of a workload executed on at least one of the cores. Other embodiments are described and claimed.

Proceedings ArticleDOI
11 Apr 2011
TL;DR: A refined persistence analysis method is presented which fixes the potential underestimation problem in the original persistence analysis and a framework to combine access pattern analysis and abstract interpretation for accurate data cache analysis is proposed.
Abstract: Caches are widely used in modern computer systems to bridge the increasing gap between processor speed and memory access time. On the other hand, presence of caches, especially data caches, complicates the static worst case execution time (WCET) analysis. Access pattern analysis (e.g., cache miss equations) are applicable to only a specific class of programs, where all array accesses must have predictable access patterns. Abstract interpretation-based methods (must/persistence analysis) determines possible cache conflicts based on coarse-grained memory access information from address analysis, which usually leads to significantly pessimistic estimation. In this paper, we first present a refined persistence analysis method which fixes the potential underestimation problem in the original persistence analysis. Based on our new persistence analysis, we propose a framework to combine access pattern analysis and abstract interpretation for accurate data cache analysis. We capture the dynamic behavior of a memory access by computing its temporal scope (the loop iterations where a given memory block is accessed for a given data reference) during address analysis. Temporal scopes as well as loop hierarchy structure (the static scopes) are integrated and utilized to achieve a more precise abstract cache state modeling. Experimental results shows that our proposed analysis obtains up to 74% reduction in the WCET estimates compared to existing data cache analysis.

Patent
30 Dec 2011
TL;DR: In this article, the authors present a method, computer program product, and computing system for receiving an indication that a virtual machine is going to be migrated from a first operating environment to a second operating environment.
Abstract: A method, computer program product, and computing system for receiving an indication that a virtual machine is going to be migrated from a first operating environment to a second operating environment. The mode of operation of a cache system associated with the virtual machine is downgraded. Content included within a memory device currently associated with the cache system is copied to a memory device to be associated with the cache system. The memory device currently associated with the cache system is detached from the virtual machine. The virtual machine is migrated from the first operating environment to the second operating environment.

Proceedings ArticleDOI
29 Nov 2011
TL;DR: This paper introduces a new method of bounding pre-emption costs, called the ECB-Union approach, which complements an existing UCB- Union approach and combines the two into a simple composite approach that dominates both.
Abstract: Without the use of cache the increasing gap between processor and memory speeds in modern embedded microprocessors would have resulted in memory access times becoming an unacceptable bottleneck. In such systems, cache related pre-emption delays can be a significant proportion of task execution times. To obtain tight bounds on the response times of tasks in pre-emptively scheduled systems, it is necessary to integrate worst-case execution time analysis and schedulability analysis via the use of an appropriate model of pre-emption costs. In this paper, we introduce a new method of bounding pre-emption costs, called the ECB-Union approach. The ECB-Union approach complements an existing UCB-Union approach. We combine the two into a simple composite approach that dominates both. These approaches are integrated into response time analysis for fixed priority pre-emptively scheduled systems. Further, we extend this analysis to systems where tasks can access resources in mutual exclusion, in the process resolving omissions in existing models of pre-emption delays. A case study and empirical evaluation demonstrate the e?ectiveness of the ECB-Union and combined approaches for a wide range of di?erent cache configurations including cache utilization, cache set size, reuse, and block reload times.

Patent
03 Jun 2011
TL;DR: In this article, a solid state drive may be used as a log structured cache, may employ multi-level metadata management, may use read and write gating, and may accelerate access to other storage media.
Abstract: Examples of described systems utilize a cache media in one or more computing devices that may accelerate access to other storage media. A solid state drive may be used as the local cache media. In some embodiments, the solid state drive may be used as a log structured cache, may employ multi-level metadata management, may use read and write gating.

Proceedings ArticleDOI
09 Mar 2011
TL;DR: A novel approach is presented that efficiently analyzes interactions between threads to determine thread correlation and detect true and false sharing, and is able to improve the performance of some applications up to a factor of 12x and shed light on the obstacles that prevent their performance from scaling to many cores.
Abstract: In today's multi-core systems, cache contention due to true and false sharing can cause unexpected and significant performance degradation. A detailed understanding of a given multi-threaded application's behavior is required to precisely identify such performance bottlenecks. Traditionally, however, such diagnostic information can only be obtained after lengthy simulation of the memory hierarchy.In this paper, we present a novel approach that efficiently analyzes interactions between threads to determine thread correlation and detect true and false sharing. It is based on the following key insight: although the slowdown caused by cache contention depends on factors including the thread-to-core binding and parameters of the memory hierarchy, the amount of data sharing is primarily a function of the cache line size and application behavior. Using memory shadowing and dynamic instrumentation, we implemented a tool that obtains detailed sharing information between threads without simulating the full complexity of the memory hierarchy. The runtime overhead of our approach --- a 5x slowdown on average relative to native execution --- is significantly less than that of detailed cache simulation. The information collected allows programmers to identify the degree of cache contention in an application, the correlation among its threads, and the sources of significant false sharing. Using our approach, we were able to improve the performance of some applications up to a factor of 12x. For other contention-intensive applications, we were able to shed light on the obstacles that prevent their performance from scaling to many cores.

Proceedings ArticleDOI
01 Oct 2011
TL;DR: This paper proposes to integrate SRAM with STT-RAM to construct a novel hybrid cache architecture for CMPs and proposes dedicated microarchitectural mechanisms to make the hybrid cache robust to workloads with different write patterns.
Abstract: Modern high performance Chip Multiprocessor (CMP) systems rely on large on-chip cache hierarchy. As technology scales down, the leakage power of present SRAM based cache gradually dominates the on-chip power consumption, which can severely jeopardize system performance. The emerging nonvolatile Spin Transfer Torque RAM (STT-RAM) is a promising candidate for large on-chip cache because of the ultra low leakage power. However, the write operations on STT-RAM suffer from considerably higher energy as well as longer latency compared with SRAM which will make STT-RAM in trouble for write-intensive workloads. In this paper, we propose to integrate SRAM with STT-RAM to construct a novel hybrid cache architecture for CMPs. We also propose dedicated microarchitectural mechanisms to make the hybrid cache robust to workloads with different write patterns. Extensive simulation results demonstrate that the proposed hybrid scheme is adaptive to variations of workloads. Overall power consumption is reduced by 37.1% and performance is improved by 23.6% on average compared with SRAM based static NUCA under the same area configuration.

Proceedings ArticleDOI
04 Jun 2011
TL;DR: The parallel cache-oblivious (PCO) model is presented, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies, and a new scheduler is described, which attains provably good cache performance and runtime on parallel machine models with hierarchical caches.
Abstract: For nested-parallel computations with low depth (span, critical path length) analyzing the work, depth, and sequential cache complexity suffices to attain reasonably strong bounds on the parallel runtime and cache complexity on machine models with either shared or private caches. These bounds, however, do not extend to general hierarchical caches, due to limitations in (i) the cache-oblivious (CO) model used to analyze cache complexity and (ii) the schedulers used to map computation tasks to processors. This paper presents the parallel cache-oblivious (PCO) model, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies. The first change is to avoid capturing artificial data sharing among parallel threads, and the second is to account for parallelism-memory imbalances within tasks. Despite the more restrictive nature of PCO compared to CO, many algorithms have the same asymptotic cache complexity bounds.The paper then describes a new scheduler for hierarchical caches, which extends recent work on "space-bounded schedulers" to allow for computations with arbitrary work imbalance among parallel subtasks. This scheduler attains provably good cache performance and runtime on parallel machine models with hierarchical caches, for nested-parallel computations analyzed using the PCO model. We show that under reasonable assumptions our scheduler is "work efficient" in the sense that the cost of the cache misses are evenly balanced across the processors---i.e., the runtime can be determined within a constant factor by taking the total cost of the cache misses analyzed for a computation and dividing it by the number of processors. In contrast, to further support our model, we show that no scheduler can achieve such bounds (optimizing for both cache misses and runtime) if work, depth, and sequential cache complexity are the only parameters used to analyze a computation.

Patent
02 Nov 2011
TL;DR: In this paper, a multi-level cache comprises a plurality of cache levels, each configured to cache I/O request data pertaining to requests of a different respective type and/or granularity.
Abstract: A multi-level cache comprises a plurality of cache levels, each configured to cache I/O request data pertaining to I/O requests of a different respective type and/or granularity A cache device manager may allocate cache storage space to each of the cache levels Each cache level maintains respective cache metadata that associates I/O request data with respective cache address The cache levels monitor I/O requests within a storage stack, apply selection criteria to identify cacheable I/O requests, and service cacheable I/O requests using the cache storage device

Proceedings ArticleDOI
05 Jun 2011
TL;DR: This paper presents a novel energy optimization technique which employs both dynamic reconfiguration of private caches and partitioning of the shared cache for multicore systems with real-time tasks and can achieve 29.29% energy saving on average.
Abstract: Multicore architectures, especially chip multi-processors, have been widely acknowledged as a successful design paradigm. Existing approaches primarily target application-driven partitioning of the shared cache to alleviate inter-core cache interference so that both performance and energy efficiency are improved. Dynamic cache reconfiguration is a promising technique in reducing energy consumption of the cache subsystem for uniprocessor systems. In this paper, we present a novel energy optimization technique which employs both dynamic reconfiguration of private caches and partitioning of the shared cache for multicore systems with real-time tasks. Our static profiling based algorithm is designed to judiciously find beneficial cache configurations (of private caches) for each task as well as partition factors (of the shared cache) for each core so that the energy consumption is minimized while task deadline is satisfied. Experimental results using real benchmarks demonstrate that our approach can achieve 29.29% energy saving on average compared to systems employing only cache partitioning.

Proceedings ArticleDOI
12 Feb 2011
TL;DR: This work proposes a novel scalable cache management framework called CloudCache that creates dynamically expanding and shrinking L2 caches for working threads with fine-grained hardware monitoring and control and demonstrates that CloudCache significantly improves performance of a wide range of workloads when all or a subset of cores are occupied.
Abstract: The number of cores in a single chip multiprocessor is expected to grow in coming years. Likewise, aggregate on-chip cache capacity is increasing fast and its effective utilization is becoming ever more important. Furthermore, available cores are expected to be underutilized due to the power wall and highly heterogeneous future workloads. This trend makes existing L2 cache management techniques less effective for two problems: increased capacity interference between working cores and longer L2 access latency. We propose a novel scalable cache management framework called CloudCache that creates dynamically expanding and shrinking L2 caches for working threads with fine-grained hardware monitoring and control. The key architectural components of CloudCache are L2 cache chaining, inter- and intra-bank cache partitioning, and a performance-optimized coherence protocol. Our extensive experimental evaluation demonstrates that CloudCache significantly improves performance of a wide range of workloads when all or a subset of cores are occupied.

Patent
30 Dec 2011
TL;DR: In this article, a method, computer program product, and computing system for copying a cache system from a first machine to a second machine, wherein the cache system includes cache content and a content directory, thus generating a duplicate cache system on the second machine.
Abstract: A method, computer program product, and computing system for copying a cache system from a first machine to a second machine, wherein the cache system includes cache content and a content directory, thus generating a duplicate cache system on the second machine. The duplicate cache system includes duplicate cache content and a duplicate content directory. A plurality of data requests concerning a plurality of data actions to be taken on a data array associated with the first machine are received on the first machine. The plurality of data requests are stored on a tracking queue included within the data array associated with the first machine.

Proceedings ArticleDOI
28 Jun 2011
TL;DR: An analysis that examines tradeoffs in terms of storage, bandwidth, and freshness of data is presented and measures the performance of the Caché approach with respect to privacy and mobile content availability using real-world mobility traces.
Abstract: We present the design, implementation, and evaluation of Cache, a system that offers location privacy for certain classes of location-based applications. The core idea in Cache is to periodically pre-fetch potentially useful location-enhanced content well in advance. Applications then retrieve content from a local cache on the mobile device when it is needed. This approach allows an end-user to make use of location-enhanced content while only revealing to third-party content providers a large geographic region rather than a precise location. In this paper, we present an analysis that examines tradeoffs in terms of storage, bandwidth, and freshness of data. We then discuss the design and implementation of an Android service embodying these ideas. Finally, we provide two evaluations of Cache. One measures the performance of our approach with respect to privacy and mobile content availability using real-world mobility traces. The other focuses on our experiences using Cache to enhance user privacy in three open source Android applications.

Proceedings ArticleDOI
13 Sep 2011
TL;DR: A low-overhead method for accurately measuring application performance (CPI) and off-chip bandwidth (GB/s) as a function of available shared cache capacity is presented.
Abstract: We present a low-overhead method for accurately measuring application performance (CPI) and off-chip bandwidth (GB/s) as a function of available shared cache capacity. The method is implemented on real hardware, with no modifications to the application or operating system. We accomplish this by co-running a Pirate application that "steals" cache space with the Target application. By adjusting how much space the Pirate steals during the Target's execution, and using hardware performance counters to record the Target's performance, we can accurately and efficiently capture performance data for the Target application as a function of its available shared cache. At the same time we use performance counters to monitor the Pirate to ensure that it is successfully stealing the desired amount of cache. To evaluate this approach, we show that 1) the cache available to the Target behaves as expected, 2) the Pirate steals the desired amount of cache, and ) the Pirate does not bias the Target's performance. As a result, we are able to accurately measure the Target's performance while stealing up to an average of 6.8MB of the 8MB of cache on our Nehalem based test system with an average measurement overhead of only 5.5%.

Patent
15 Jul 2011
TL;DR: In this article, a pipelined processor comprising a cache memory system, fetching instructions for execution from a portion of said cache memory systems, an instruction commencing processing before a digital signature of the cache line that contained the instruction is verified against a reference signature of cache line, the verification being done at the point of decoding, dispatching, or committing execution of the instruction, the reference signature being stored in an encrypted form in the processor's memory, and the key for decrypting the said reference signature was stored in a secure storage location.
Abstract: A pipelined processor comprising a cache memory system, fetching instructions for execution from a portion of said cache memory system, an instruction commencing processing before a digital signature of the cache line that contained the instruction is verified against a reference signature of the cache line, the verification being done at the point of decoding, dispatching, or committing execution of the instruction, the reference signature being stored in an encrypted form in the processor's memory, and the key for decrypting the said reference signature being stored in a secure storage location. The instruction processing proceeds when the two signatures exactly match and, where further instruction processing is suspended or processing modified on a mismatch of the two said signatures.