scispace - formally typeset
Search or ask a question

Showing papers on "Cache invalidation published in 2011"


Proceedings ArticleDOI
03 Dec 2011
TL;DR: This paper proposes a novel Signature-based Hit Predictor (SHiP) to learn the re-reference behavior of cache lines belonging to each signature, and finds that SHiP offers substantial improvements over the baseline LRU replacement and state-of-the-art replacement policy proposals.
Abstract: The shared last-level caches in CMPs play an important role in improving application performance and reducing off-chip memory bandwidth requirements. In order to use LLCs more efficiently, recent research has shown that changing the re-reference prediction on cache insertions and cache hits can significantly improve cache performance. A fundamental challenge, however, is how to best predict the re-reference pattern of an incoming cache line. This paper shows that cache performance can be improved by correlating the re-reference behavior of a cache line with a unique signature. We investigate the use of memory region, program counter, and instruction sequence history based signatures. We also propose a novel Signature-based Hit Predictor (SHiP) to learn the re-reference behavior of cache lines belonging to each signature. Overall, we find that SHiP offers substantial improvements over the baseline LRU replacement and state-of-the-art replacement policy proposals. On average, SHiP improves sequential and multiprogrammed application performance by roughly 10% and 12% over LRU replacement, respectively. Compared to recent replacement policy proposals such as Seg-LRU and SDBP, SHiP nearly doubles the performance gains while requiring less hardware overhead.

235 citations


Proceedings ArticleDOI
03 Dec 2011
TL;DR: The experiments show that on the average, the proposed multi retention level STT-RAM cache reduces 30 ∼ 70% of the total energy compared to previous works, while improving IPC performance for both 2-level and 3-level cache hierarchy.
Abstract: Spin-transfer torque random access memory (STT-RAM) has received increasing attention because of its attractive features: good scalability, zero standby power, non-volatility and radiation hardness. The use of STT-RAM technology in the last level on-chip caches has been proposed as it minimizes cache leakage power with technology scaling down. Furthermore, the cell area of STT-RAM is only 1/9 ~ 1/3 that of SRAM. This allows for a much larger cache with the same die footprint, improving overall system performance through reducing cache misses. However, deploying STT-RAM technology in L1 caches is challenging because of the long and power-consuming write operations. In this paper, we propose both L1 and lower level cache designs that use STT-RAM. In particular, our designs use STT-RAM cells with various data retention time and write performances, made possible by different magnetic tunneling junction (MTJ) designs. For the fast STT-RAM bits with reduced data retention time, a counter controlled dynamic refresh scheme is proposed to maintain the data validity. Our dynamic scheme saves more than 80% refresh energy compared to the simple refresh scheme proposed in previous works. A L1 cache built with ultra low retention STT-RAM coupled with our proposed dynamic refresh scheme can achieve 9.2% in performance improvement, and saves up to 30% of the total energy when compared to one that uses traditional SRAM. For lower level caches with relative large cache capacity, we propose a data migration scheme that moves data between portions of the cache with different retention characteristics so as to maximize the performance and power benefits. Our experiments show that on the average, our proposed multi retention level STT-RAM cache reduces 30 ~ 70% of the total energy compared to previous works, while improving IPC performance for both 2-level and 3-level cache hierarchy.

234 citations


Proceedings ArticleDOI
04 Jun 2011
TL;DR: This work presents Vantage, a novel cache partitioning technique that overcomes the limitations of existing schemes: caches can have tens of partitions with sizes specified at cache line granularity, while maintaining high associativity and strong isolation among partitions.
Abstract: Cache partitioning has a wide range of uses in CMPs, from guaranteeing quality of service and controlled sharing to security-related techniques. However, existing cache partitioning schemes (such as way-partitioning) are limited to coarse-grain allocations, can only support few partitions, and reduce cache associativity, hurting performance. Hence, these techniques can only be applied to CMPs with 2-4 cores, but fail to scale to tens of cores. We present Vantage, a novel cache partitioning technique that overcomes the limitations of existing schemes: caches can have tens of partitions with sizes specified at cache line granularity, while maintaining high associativity and strong isolation among partitions. Vantage leverages cache arrays with good hashing and associativity, which enable soft-pinning a large portion of cache lines. It enforces capacity allocations by controlling the replacement process. Unlike prior schemes, Vantage provides strict isolation guarantees by partitioning most (e.g. 90%) of the cache instead of all of it. Vantage is derived from analytical models, which allow us to provide strong guarantees and bounds on associativity and sizing independent of the number of partitions and their behaviors. It is simple to implement, requiring around 1.5% state overhead and simple changes to the cache controller. We evaluate Vantage using extensive simulations. On a 32-core system, using 350 multiprogrammed workloads and one partition per core, partitioning the last-level cache with conventional techniques degrades throughput for 71% of the workloads versus an unpartitioned cache (by 7% average, 25% maximum degradation), even when using 64-way caches. In contrast, Vantage improves throughput for 98% of the workloads, by 8% on average (up to 20%), using a 4-way cache.

229 citations


Patent
30 Nov 2011
TL;DR: A distributed caching system for storing and serving information modeled as a graph that includes nodes and edges that define associations or relationships between nodes that the edges connect in the graph is described in this article.
Abstract: A distributed caching system for storing and serving information modeled as a graph that includes nodes and edges that define associations or relationships between nodes that the edges connect in the graph.

167 citations


Proceedings ArticleDOI
27 Jun 2011
TL;DR: It is shown that stealing crypto keys in a virtualized cloud may be a real threat by evaluating a cache-based side-channel attack against an encryption process and proposing an approach that leverages dynamic cache coloring: when an application is doing security-sensitive operations, the VMM is notified to swap the associated data to a safe and isolated cache line.
Abstract: Multi-tenant cloud, which features utility-like computing resources to tenants in a “pay-as-you-go” style, has been commercially popular for years. As one of the sole purposes of such a cloud is maximizing resource usages to increase its revenue, it usually uses virtualization to consolidate VMs from different and even mutually-malicious tenants atop a powerful physical machine. This, however, also enables a malicious tenant to steal security-critical information such as crypto keys from victims, due to the shared physical resources such as caches. In this paper, we show that stealing crypto keys in a virtualized cloud may be a real threat by evaluating a cache-based side-channel attack against an encryption process. To mitigate such attacks while not notably degrading performance, we propose an approach that leverages dynamic cache coloring: when an application is doing security-sensitive operations, the VMM is notified to swap the associated data to a safe and isolated cache line. This approach may eliminate cache-based side-channel for security-critical operations, yet ensure efficient resource sharing during normal operations. We demonstrate the applicability by illustrating a preliminary implementation based on Xen and its performance overhead.

164 citations


Patent
01 Nov 2011
TL;DR: In this paper, a cache-defeat detection system and methods for caching of content addressed by identifiers intended to defeat cache are further disclosed, which can detect a data request to a content source for which content received is stored as cache elements in a local cache on the mobile device, determining, from an identifier of the data request, that a cache defeating mechanism is used by the content source, and/or retrieving content from the cached elements in the local cache to respond to the request.
Abstract: Systems and methods for cache defeat detection are disclosed. Moreover, systems and methods for caching of content addressed by identifiers intended to defeat cache are further disclosed. In one aspect, embodiments of the present disclosure include a method, which may be implemented on a system, of resource management in a wireless network by caching content on a mobile device. The method can include detecting a data request to a content source for which content received is stored as cache elements in a local cache on the mobile device, determining, from an identifier of the data request, that a cache defeating mechanism is used by the content source, and/or retrieving content from the cache elements in the local cache to respond to the data request.

161 citations


Patent
12 Aug 2011
TL;DR: In this paper, a storage request module detects an input/output (I/O) request for a storage device cached by solid-state storage media of a cache, and a direct mapping module references a single mapping structure to determine that the cache comprises data of the I/O request.
Abstract: An apparatus, system, and method are disclosed for caching data. A storage request module detects an input/output (“I/O”) request for a storage device cached by solid-state storage media of a cache. A direct mapping module references a single mapping structure to determine that the cache comprises data of the I/O request. The single mapping structure maps each logical block address of the storage device directly to a logical block address of the cache. The single mapping structure maintains a fully associative relationship between logical block addresses of the storage device and physical storage addresses on the solid-state storage media. A cache fulfillment module satisfies the I/O request using the cache in response to the direct mapping module determining that the cache comprises at least one data block of the I/O request.

106 citations


Proceedings ArticleDOI
01 Aug 2011
TL;DR: A low-overhead, fully-hardware technique is utilized to detect write-intensive data blocks of working set and place them into SRAM lines while the remaining data blocks are candidates to be remapped onto STT-RAM blocks during system operation.
Abstract: In this paper, we propose a run-time strategy for managing writes onto last level cache in chip multiprocessors where STT-RAM memory is used as baseline technology. To this end, we assume that each cache set is decomposed into limited SRAM lines and large number of STT-RAM lines. SRAM lines are target of frequently-written data and rarely-written or read-only ones are pushed into STT-RAM. As a novel contribution, a low-overhead, fully-hardware technique is utilized to detect write-intensive data blocks of working set and place them into SRAM lines while the remaining data blocks are candidates to be remapped onto STT-RAM blocks during system operation. Therefore, the achieved cache architecture has large capacity and consumes near zero leakage energy using STT-RAM array; while dynamic write energy, acceptable write latency, and long lifetime is guaranteed via SRAM array. Results of full-system simulation for a quad-core CMP running PARSEC-2 benchmark suit confirm an average of 49 times improvement in cache lifetime and more than 50% reduction in cache power consumption when compared to baseline configurations.

101 citations


Proceedings ArticleDOI
04 Jun 2011
TL;DR: This paper researches the integration of STT-RAM in a 3D multi-core environment and proposes solutions at the on-chip network level to circumvent the write overhead problem in the cache architecture with STt-RAM technology.
Abstract: Emerging memory technologies such as STT-RAM, PCRAM, and resistive RAM are being explored as potential replacements to existing on-chip caches or main memories for future multi-core architectures. This is due to the many attractive features these memory technologies posses: high density, low leakage, and non-volatility. However, the latency and energy overhead associated with the write operations of these emerging memories has become a major obstacle in their adoption. Previous works have proposed various circuit and architectural level solutions to mitigate the write overhead. In this paper, we study the integration of STT-RAM in a 3D multi-core environment and propose solutions at the on-chip network level to circumvent the write overhead problem in the cache architecture with STT-RAM technology. Our scheme is based on the observation that instead of staggering requests to a write-busy STT-RAM bank, the network should schedule requests to other idle cache banks for effectively hiding the latency. Thus, we prioritize cache accesses to the idle banks by delaying accesses to the STT-RAM cache banks that are currently serving long latency write requests. Through a detailed characterization of the cache access patterns of 42 applications, we propose an efficient mechanism to facilitate such delayed writes to cache banks by (a) accurately estimating the busy time of each cache bank through logical partitioning of the cache layer and (b) prioritizing packets in a router requesting accesses to idle banks. Evaluations on a 3D architecture, consisting of 64 cores and 64 STT-RAM cache banks, show that our proposed approach provides 14% average IPC improvement for multi-threaded benchmarks, 19% instruction throughput benefits for multi-programmed workloads, and 6% latency reduction compared to a recently proposed write buffering mechanism.

100 citations


Proceedings ArticleDOI
11 Apr 2011
TL;DR: A refined persistence analysis method is presented which fixes the potential underestimation problem in the original persistence analysis and a framework to combine access pattern analysis and abstract interpretation for accurate data cache analysis is proposed.
Abstract: Caches are widely used in modern computer systems to bridge the increasing gap between processor speed and memory access time. On the other hand, presence of caches, especially data caches, complicates the static worst case execution time (WCET) analysis. Access pattern analysis (e.g., cache miss equations) are applicable to only a specific class of programs, where all array accesses must have predictable access patterns. Abstract interpretation-based methods (must/persistence analysis) determines possible cache conflicts based on coarse-grained memory access information from address analysis, which usually leads to significantly pessimistic estimation. In this paper, we first present a refined persistence analysis method which fixes the potential underestimation problem in the original persistence analysis. Based on our new persistence analysis, we propose a framework to combine access pattern analysis and abstract interpretation for accurate data cache analysis. We capture the dynamic behavior of a memory access by computing its temporal scope (the loop iterations where a given memory block is accessed for a given data reference) during address analysis. Temporal scopes as well as loop hierarchy structure (the static scopes) are integrated and utilized to achieve a more precise abstract cache state modeling. Experimental results shows that our proposed analysis obtains up to 74% reduction in the WCET estimates compared to existing data cache analysis.

94 citations


Patent
30 Dec 2011
TL;DR: In this article, the authors present a method, computer program product, and computing system for receiving an indication that a virtual machine is going to be migrated from a first operating environment to a second operating environment.
Abstract: A method, computer program product, and computing system for receiving an indication that a virtual machine is going to be migrated from a first operating environment to a second operating environment. The mode of operation of a cache system associated with the virtual machine is downgraded. Content included within a memory device currently associated with the cache system is copied to a memory device to be associated with the cache system. The memory device currently associated with the cache system is detached from the virtual machine. The virtual machine is migrated from the first operating environment to the second operating environment.

Proceedings ArticleDOI
29 Nov 2011
TL;DR: This paper introduces a new method of bounding pre-emption costs, called the ECB-Union approach, which complements an existing UCB- Union approach and combines the two into a simple composite approach that dominates both.
Abstract: Without the use of cache the increasing gap between processor and memory speeds in modern embedded microprocessors would have resulted in memory access times becoming an unacceptable bottleneck. In such systems, cache related pre-emption delays can be a significant proportion of task execution times. To obtain tight bounds on the response times of tasks in pre-emptively scheduled systems, it is necessary to integrate worst-case execution time analysis and schedulability analysis via the use of an appropriate model of pre-emption costs. In this paper, we introduce a new method of bounding pre-emption costs, called the ECB-Union approach. The ECB-Union approach complements an existing UCB-Union approach. We combine the two into a simple composite approach that dominates both. These approaches are integrated into response time analysis for fixed priority pre-emptively scheduled systems. Further, we extend this analysis to systems where tasks can access resources in mutual exclusion, in the process resolving omissions in existing models of pre-emption delays. A case study and empirical evaluation demonstrate the e?ectiveness of the ECB-Union and combined approaches for a wide range of di?erent cache configurations including cache utilization, cache set size, reuse, and block reload times.

Patent
03 Jun 2011
TL;DR: In this article, a solid state drive may be used as a log structured cache, may employ multi-level metadata management, may use read and write gating, and may accelerate access to other storage media.
Abstract: Examples of described systems utilize a cache media in one or more computing devices that may accelerate access to other storage media. A solid state drive may be used as the local cache media. In some embodiments, the solid state drive may be used as a log structured cache, may employ multi-level metadata management, may use read and write gating.

Proceedings ArticleDOI
09 Mar 2011
TL;DR: A novel approach is presented that efficiently analyzes interactions between threads to determine thread correlation and detect true and false sharing, and is able to improve the performance of some applications up to a factor of 12x and shed light on the obstacles that prevent their performance from scaling to many cores.
Abstract: In today's multi-core systems, cache contention due to true and false sharing can cause unexpected and significant performance degradation. A detailed understanding of a given multi-threaded application's behavior is required to precisely identify such performance bottlenecks. Traditionally, however, such diagnostic information can only be obtained after lengthy simulation of the memory hierarchy.In this paper, we present a novel approach that efficiently analyzes interactions between threads to determine thread correlation and detect true and false sharing. It is based on the following key insight: although the slowdown caused by cache contention depends on factors including the thread-to-core binding and parameters of the memory hierarchy, the amount of data sharing is primarily a function of the cache line size and application behavior. Using memory shadowing and dynamic instrumentation, we implemented a tool that obtains detailed sharing information between threads without simulating the full complexity of the memory hierarchy. The runtime overhead of our approach --- a 5x slowdown on average relative to native execution --- is significantly less than that of detailed cache simulation. The information collected allows programmers to identify the degree of cache contention in an application, the correlation among its threads, and the sources of significant false sharing. Using our approach, we were able to improve the performance of some applications up to a factor of 12x. For other contention-intensive applications, we were able to shed light on the obstacles that prevent their performance from scaling to many cores.

Proceedings ArticleDOI
04 Jun 2011
TL;DR: The parallel cache-oblivious (PCO) model is presented, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies, and a new scheduler is described, which attains provably good cache performance and runtime on parallel machine models with hierarchical caches.
Abstract: For nested-parallel computations with low depth (span, critical path length) analyzing the work, depth, and sequential cache complexity suffices to attain reasonably strong bounds on the parallel runtime and cache complexity on machine models with either shared or private caches. These bounds, however, do not extend to general hierarchical caches, due to limitations in (i) the cache-oblivious (CO) model used to analyze cache complexity and (ii) the schedulers used to map computation tasks to processors. This paper presents the parallel cache-oblivious (PCO) model, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies. The first change is to avoid capturing artificial data sharing among parallel threads, and the second is to account for parallelism-memory imbalances within tasks. Despite the more restrictive nature of PCO compared to CO, many algorithms have the same asymptotic cache complexity bounds.The paper then describes a new scheduler for hierarchical caches, which extends recent work on "space-bounded schedulers" to allow for computations with arbitrary work imbalance among parallel subtasks. This scheduler attains provably good cache performance and runtime on parallel machine models with hierarchical caches, for nested-parallel computations analyzed using the PCO model. We show that under reasonable assumptions our scheduler is "work efficient" in the sense that the cost of the cache misses are evenly balanced across the processors---i.e., the runtime can be determined within a constant factor by taking the total cost of the cache misses analyzed for a computation and dividing it by the number of processors. In contrast, to further support our model, we show that no scheduler can achieve such bounds (optimizing for both cache misses and runtime) if work, depth, and sequential cache complexity are the only parameters used to analyze a computation.

Patent
02 Nov 2011
TL;DR: In this paper, a multi-level cache comprises a plurality of cache levels, each configured to cache I/O request data pertaining to requests of a different respective type and/or granularity.
Abstract: A multi-level cache comprises a plurality of cache levels, each configured to cache I/O request data pertaining to I/O requests of a different respective type and/or granularity A cache device manager may allocate cache storage space to each of the cache levels Each cache level maintains respective cache metadata that associates I/O request data with respective cache address The cache levels monitor I/O requests within a storage stack, apply selection criteria to identify cacheable I/O requests, and service cacheable I/O requests using the cache storage device

Proceedings ArticleDOI
05 Jun 2011
TL;DR: This paper presents a novel energy optimization technique which employs both dynamic reconfiguration of private caches and partitioning of the shared cache for multicore systems with real-time tasks and can achieve 29.29% energy saving on average.
Abstract: Multicore architectures, especially chip multi-processors, have been widely acknowledged as a successful design paradigm. Existing approaches primarily target application-driven partitioning of the shared cache to alleviate inter-core cache interference so that both performance and energy efficiency are improved. Dynamic cache reconfiguration is a promising technique in reducing energy consumption of the cache subsystem for uniprocessor systems. In this paper, we present a novel energy optimization technique which employs both dynamic reconfiguration of private caches and partitioning of the shared cache for multicore systems with real-time tasks. Our static profiling based algorithm is designed to judiciously find beneficial cache configurations (of private caches) for each task as well as partition factors (of the shared cache) for each core so that the energy consumption is minimized while task deadline is satisfied. Experimental results using real benchmarks demonstrate that our approach can achieve 29.29% energy saving on average compared to systems employing only cache partitioning.

Patent
30 Dec 2011
TL;DR: In this article, a method, computer program product, and computing system for copying a cache system from a first machine to a second machine, wherein the cache system includes cache content and a content directory, thus generating a duplicate cache system on the second machine.
Abstract: A method, computer program product, and computing system for copying a cache system from a first machine to a second machine, wherein the cache system includes cache content and a content directory, thus generating a duplicate cache system on the second machine. The duplicate cache system includes duplicate cache content and a duplicate content directory. A plurality of data requests concerning a plurality of data actions to be taken on a data array associated with the first machine are received on the first machine. The plurality of data requests are stored on a tracking queue included within the data array associated with the first machine.

Proceedings ArticleDOI
13 Sep 2011
TL;DR: A low-overhead method for accurately measuring application performance (CPI) and off-chip bandwidth (GB/s) as a function of available shared cache capacity is presented.
Abstract: We present a low-overhead method for accurately measuring application performance (CPI) and off-chip bandwidth (GB/s) as a function of available shared cache capacity. The method is implemented on real hardware, with no modifications to the application or operating system. We accomplish this by co-running a Pirate application that "steals" cache space with the Target application. By adjusting how much space the Pirate steals during the Target's execution, and using hardware performance counters to record the Target's performance, we can accurately and efficiently capture performance data for the Target application as a function of its available shared cache. At the same time we use performance counters to monitor the Pirate to ensure that it is successfully stealing the desired amount of cache. To evaluate this approach, we show that 1) the cache available to the Target behaves as expected, 2) the Pirate steals the desired amount of cache, and ) the Pirate does not bias the Target's performance. As a result, we are able to accurately measure the Target's performance while stealing up to an average of 6.8MB of the 8MB of cache on our Nehalem based test system with an average measurement overhead of only 5.5%.

Patent
30 Sep 2011
TL;DR: In this paper, a change in workload characteristics detected at one tier of a multi-tier cache is communicated to another tier of the multi-tiered cache, and at least one cache element is dynamically resizable.
Abstract: A change in workload characteristics detected at one tier of a multi-tiered cache is communicated to another tier of the multi-tiered cache. Multiple caching elements exist at different tiers, and at least one tier includes a cache element that is dynamically resizable. The communicated change in workload characteristics causes the receiving tier to adjust at least one aspect of cache performance in the multi-tiered cache. In one aspect, at least one dynamically resizable element in the multi-tiered cache is resized responsive to the change in workload characteristics.

Journal ArticleDOI
TL;DR: A survey of the state of the art techniques to bound theCRPD, based on, but not limited to UCBs, an alternative definition of UCBs to improve the CRPD bounds substantially.

Book
23 May 2011
TL;DR: The book attempts a synthesis of recent cache research that has focused on innovations for multi-core processors and is an excellent starting point for early-stage graduate students, researchers, and practitioners who wish to understand the landscape of recent Cache research.
Abstract: A key determinant of overall system performance and power dissipation is the cache hierarchy since access to off-chip memory consumes many more cycles and energy than on-chip accesses. In addition, multi-core processors are expected to place ever higher bandwidth demands on the memory system. All these issues make it important to avoid off-chip memory access by improving the efficiency of the on-chip cache. Future multi-core processors will have many large cache banks connected by a network and shared by many cores. Hence, many important problems must be solved: cache resources must be allocated across many cores, data must be placed in cache banks that are near the accessing core, and the most important data must be identified for retention. Finally, difficulties in scaling existing technologies require adapting to and exploiting new technology constraints. The book attempts a synthesis of recent cache research that has focused on innovations for multi-core processors. It is an excellent starting point for early-stage graduate students, researchers, and practitioners who wish to understand the landscape of recent cache research. The book is suitable as a reference for advanced computer architecture classes as well as for experienced researchers and VLSI engineers. Table of Contents: Basic Elements of Large Cache Design / Organizing Data in CMP Last Level Caches / Policies Impacting Cache Hit Rates / Interconnection Networks within Large Caches / Technology / Concluding Remarks

Patent
20 May 2011
TL;DR: In this paper, a cache memory system that uses multi-bit error correcting code (ECC) with a low storage and complexity overhead is presented, without dramatically increasing transition latency to and from an idle power state due to loss of state.
Abstract: A cache memory system is provided that uses multi-bit Error Correcting Code (ECC) with a low storage and complexity overhead. The cache memory system can be operated at very low idle power, without dramatically increasing transition latency to and from an idle power state due to loss of state.

Patent
Ali Mashtizadeh1, Irfan Ahmad1
13 Jul 2011
TL;DR: In this paper, a technique for managing memory within a virtualized system that includes a memory compression cache is described. But this technique is limited to the first-in-touch-out (FITO) list.
Abstract: Techniques are disclosed for managing memory within a virtualized system that includes a memory compression cache. Generally, the virtualized system may include a hypervisor configured to use a compression cache to temporarily store memory pages that have been compressed to conserve memory space. A “first-in touch-out” (FITO) list may be used to manage the size of the compression cache by monitoring the compressed memory pages in the compression cache. Each element in the FITO list corresponds to a compressed page in the compression cache. Each element in the FITO list records a time at which the corresponding compressed page was stored in the compression cache (i.e. an age). A size of the compression cache may be adjusted based on the ages of the pages in the compression cache.

Proceedings ArticleDOI
12 Feb 2011
TL;DR: ULCC (User Level Cache Control), a software runtime library that enables programmers to explicitly manage and optimize last level cache usage by allocating proper cache space for different data sets of different threads, is implemented at the user level based on a page-coloring technique for lastlevel cache usage management.
Abstract: Scientific applications face serious performance challenges on multicore processors, one of which is caused by access contention in last level shared caches from multiple running threads. The contention increases the number of long latency memory accesses, and consequently increases application execution times. Optimizing shared cache performance is critical to reduce significantly execution times of multi-threaded programs on multicores. However, there are two unique problems to be solved before implementing cache optimization techniques on multicores at the user level. First, available cache space for each running thread in a last level cache is difficult to predict due to access contention in the shared space, which makes cache conscious algorithms for single cores ineffective on multicores. Second, at the user level, programmers are not able to allocate cache space at will to running threads in the shared cache, thus data sets with strong locality may not be allocated with sufficient cache space, and cache pollution can easily happen. To address these two critical issues, we have designed ULCC (User Level Cache Control), a software runtime library that enables programmers to explicitly manage and optimize last level cache usage by allocating proper cache space for different data sets of different threads. We have implemented ULCC at the user level based on a page-coloring technique for last level cache usage management. By means of multiple case studies on an Intel multicore processor, we show that with ULCC, scientific applications can achieve significant performance improvements by fully exploiting the benefit of cache optimization algorithms and by partitioning the cache space accordingly to protect frequently reused data sets and to avoid cache pollution. Our experiments with various applications show that ULCC can significantly improve application performance by nearly 40%.

Proceedings ArticleDOI
01 Aug 2011
TL;DR: An adaptive hybrid cache is proposed to dynamically remap SPM blocks from high-demand cache sets to low-demand caches, which achieves 19%, 25%, 18% and 18% energy-runtime-production reductions over four previous representative techniques on a wide range of benchmarks.
Abstract: By reconfiguring part of the cache as software-managed scratchpad memory (SPM), hybrid caches manage to handle both unknown and predictable memory access patterns. However, existing hybrid caches provide a flexible partitioning of cache and SPM without considering adaptation to the run-time cache behavior. Previous cache set balancing techniques are either energy-inefficient or require serial tag and data array access. In this paper an adaptive hybrid cache is proposed to dynamically remap SPM blocks from high-demand cache sets to low-demand cache sets. This achieves 19%, 25%, 18% and 18% energy-runtime-production reductions over four previous representative techniques on a wide range of benchmarks.

Proceedings ArticleDOI
12 Feb 2011
TL;DR: A new PC-centric cache organization, NUcache, is proposed, which logically partitions the associative ways of a cache set into MainWays and DeliWays, and is shown to be more effective than other well-known cache-partitioning algorithms.
Abstract: The effectiveness of the last-level shared cache is crucial to the performance of a multi-core system. In this paper, we observe and make use of the DelinquentPC — Next-Use characteristic to improve shared cache performance. We propose a new PC-centric cache organization, NUcache, for the shared last level cache of multi-cores. NUcache logically partitions the associative ways of a cache set into MainWays and DeliWays. While all lines have access to the MainWays, only lines brought in by a subset of delinquent PCs, selected by a PC selection mechanism, are allowed to enter the DeliWays. The PC selection mechanism is an intelligent cost-benefit analysis based algorithm that utilizes Next-Use information to select the set of PCs that can maximize the hits experienced in DeliWays. Performance evaluation reveals that NUcache improves the performance over a baseline design by 9.6%, 30% and 33% respectively for dual, quad and eight core workloads comprised of SPEC benchmarks. We also show that NUcache is more effective than other well-known cache-partitioning algorithms.

Journal ArticleDOI
TL;DR: The main contribution of this work lies in the possibility of applying state-of-the-art memory test algorithms to embedded cache memories without introducing any hardware or performance overheads and guaranteeing the detection of typical faults arising in nanometer CMOS technologies.
Abstract: Embedded microprocessor cache memories suffer from limited observability and controllability creating problems during in-system tests. This paper presents a procedure to transform traditional march tests into software-based self-test programs for set-associative cache memories with LRU replacement. Among all the different cache blocks in a microprocessor, testing instruction caches represents a major challenge due to limitations in two areas: 1) test patterns which must be composed of valid instruction opcodes and 2) test result observability: the results can only be observed through the results of executed instructions. For these reasons, the proposed methodology will concentrate on the implementation of test programs for instruction caches. The main contribution of this work lies in the possibility of applying state-of-the-art memory test algorithms to embedded cache memories without introducing any hardware or performance overheads and guaranteeing the detection of typical faults arising in nanometer CMOS technologies.

Patent
30 Dec 2011
TL;DR: In this paper, the cache is dynamically resized to an optimal cache size based on a comparison of the cache performance parameters to their energy-efficient targets to reduce power consumption, and a set of one or more cache performance parameter is determined.
Abstract: Embodiments of systems, apparatuses, and methods for energy-efficient operation of a device are described. In some embodiments, a cache performance indicator of a cache is monitored, and a set of one or more cache performance parameters based on the cache performance indicator is determined. The cache is dynamically resized to an optimal cache size based on a comparison of the cache performance parameters to their energy-efficient targets to reduce power consumption.

Patent
24 Feb 2011
TL;DR: In this paper, a method of controlling the exclusivity mode of a level-two cache is presented, which includes generating level two cache exclusivity control information at a processor in response to an exclusive mode indicator.
Abstract: A method of controlling the exclusivity mode of a level-two cache includes generating level-two cache exclusivity control information at a processor in response to an exclusivity mode indicator, and utilizing the level-two cache exclusivity control information to configure the exclusivity mode of the level-two cache.