scispace - formally typeset
Search or ask a question

Showing papers on "Cache coloring published in 2012"


Proceedings ArticleDOI
01 Dec 2012
TL;DR: This paper proposes Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity.
Abstract: This paper studies the effects of hardware thread scheduling on cache management in GPUs. We propose Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity. In contrast to improvements in the replacement policy that can better tolerate difficult access patterns, CCWS shapes the access pattern to avoid thrashing the shared L1. We show that CCWS can outperform any replacement scheme by evaluating against the Belady-optimal policy. Our evaluation demonstrates that cache efficiency and preservation of intra-wave front locality become more important as GPU computing expands beyond use in high performance computing. At an estimated cost of 0.17% total chip area, CCWS reduces the number of threads actively issued on a core when appropriate. This leads to an average 25% fewer L1 data cache misses which results in a harmonic mean 24% performance improvement over previously proposed scheduling policies across a diverse selection of cache-sensitive workloads.

408 citations


Proceedings ArticleDOI
19 Sep 2012
TL;DR: There is a need for a simple yet efficient compression technique that can effectively compress common in-cache data patterns, and has minimal effect on cache access latency.
Abstract: Cache compression is a promising technique to increase on-chip cache capacity and to decrease on-chip and off-chip bandwidth usage. Unfortunately, directly applying well-known compression algorithms (usually implemented in software) leads to high hardware complexity and unacceptable decompression/compression latencies, which in turn can negatively affect performance. Hence, there is a need for a simple yet efficient compression technique that can effectively compress common in-cache data patterns, and has minimal effect on cache access latency.

348 citations


Proceedings Article
08 Aug 2012
TL;DR: STEALTHMEM is presented, a system-level protection mechanism against cache-based side channel attacks in the cloud and a novel idea and prototype for isolating cache lines while fully utilizing memory by exploiting architectural properties of set-associative caches.
Abstract: Cloud services are rapidly gaining adoption due to the promises of cost efficiency, availability, and on-demand scaling. To achieve these promises, cloud providers share physical resources to support multi-tenancy of cloud platforms. However, the possibility of sharing the same hardware with potential attackers makes users reluctant to offload sensitive data into the cloud. Worse yet, researchers have demonstrated side channel attacks via shared memory caches to break full encryption keys of AES, DES, and RSA. We present STEALTHMEM, a system-level protection mechanism against cache-based side channel attacks in the cloud. STEALTHMEM manages a set of locked cache lines per core, which are never evicted from the cache, and efficiently multiplexes them so that each VM can load its own sensitive data into the locked cache lines. Thus, any VM can hide memory access patterns on confidential data from other VMs. Unlike existing state-of-the-art mitigation methods, STEALTHMEM works with existing commodity hardware and does not require profound changes to application software. We also present a novel idea and prototype for isolating cache lines while fully utilizing memory by exploiting architectural properties of set-associative caches. STEALTHMEM imposes 5.9% of performance overhead on the SPEC 2006 CPU benchmark, and between 2% and 5% overhead on secured AES, DES and Blowfish, requiring only between 3 and 34 lines of code changes from the original implementations.

336 citations


Journal ArticleDOI
TL;DR: On-chip hardware coherence can scale gracefully as the number of cores increases, and the value of these cores can increase with increasing number of processors.
Abstract: Today’s multicore chips commonly implement shared memory with cache coherence as low-level support for operating systems and application software. Technology trends continue to enable the scaling of the number of (processor) cores per chip. Because conventional wisdom says that the coherence does not scale well to many cores, some prognosticators predict the end of coherence. This paper seeks to refute this conventional wisdom by showing one way to scale on-chip cache coherence with bounded, modest costs by combining known techniques such as: shared caches augmented to track cached copies, explicit cache eviction notifications, and hierarchical design. Based on this scalable proof-of-concept design, we predict that on-chip coherence and the programming convenience and compatibility it provides are here to stay.

298 citations


Proceedings ArticleDOI
25 Mar 2012
TL;DR: Results demonstrate that caching VoD in access routers offers a highly favorable bandwidth memory tradeoff but that the other types of content would likely be more efficiently handled in very large capacity storage devices in the core.
Abstract: For a realistic traffic mix, we evaluate the hit rates attained in a two-layer cache hierarchy designed to reduce Internet bandwidth requirements. The model identifies four main types of content, web, file sharing, user generated content and video on demand, distinguished in terms of their traffic shares, their population and object sizes and their popularity distributions. Results demonstrate that caching VoD in access routers offers a highly favorable bandwidth memory tradeoff but that the other types of content would likely be more efficiently handled in very large capacity storage devices in the core. Evaluations are based on a simple approximation for LRU cache performance that proves highly accurate in relevant configurations.

267 citations


Proceedings ArticleDOI
01 Dec 2012
TL;DR: This paper proposes a latency-optimized cache architecture, called Alloy Cache, that eliminates the delay due to tag serialization by streaming tag and data together in a single burst, and proposes a simple and highly effective Memory Access Predictor.
Abstract: This paper analyzes the design trade-offs in architecting large-scale DRAM caches. Prior research, including the recent work from Loh and Hill, have organized DRAM caches similar to conventional caches. In this paper, we contend that some of the basic design decisions typically made for conventional caches (such as serialization of tag and data access, large associativity, and update of replacement state) are detrimental to the performance of DRAM caches, as they exacerbate the already high hit latency. We show that higher performance can be obtained by optimizing the DRAM cache architecture first for latency, and then for hit rate. We propose a latency-optimized cache architecture, called Alloy Cache, that eliminates the delay due to tag serialization by streaming tag and data together in a single burst. We also propose a simple and highly effective Memory Access Predictor that incurs a storage overhead of 96 bytes per core and a latency of 1 cycle. It helps service cache misses faster without the need to wait for a cache miss detection in the common case. Our evaluations show that our latency-optimized cache design significantly outperforms both the recent proposal from Loh and Hill, as well as an impractical SRAM Tag-Store design that incurs an unacceptable overhead of several tens of megabytes. On average, the proposal from Loh and Hill provides 8.7% performance improvement, the "idealized" SRAM Tag design provides 24%, and our simple latency-optimized design provides 35%.

259 citations


Proceedings ArticleDOI
01 Dec 2012
TL;DR: The implementation in the IBM zEnterprise EC12 (zEC12) microprocessor generation, focusing on how transactional memory can be embedded into the existing cache design and multiprocessor shared-memory infrastructure, is described.
Abstract: We present the introduction of transactional memory into the next generation IBM System z CPU. We first describe the instruction-set architecture features, including requirements for enterprise-class software RAS. We then describe the implementation in the IBM zEnterprise EC12 (zEC12) microprocessor generation, focusing on how transactional memory can be embedded into the existing cache design and multiprocessor shared-memory infrastructure. We explain practical reasons behind our choices. The zEC12 system is available since September 2012.

244 citations


Journal ArticleDOI
26 Jan 2012
TL;DR: A flexibly-partitioned cache design that either drastically weakens or completely eliminates cache-based side channel attacks, and can provide strong security guarantees for the AES and Blowfish encryption algorithms.
Abstract: We propose a flexibly-partitioned cache design that either drastically weakens or completely eliminates cache-based side channel attacks. The proposed Non-Monopolizable (NoMo) cache dynamically reserves cache lines for active threads and prevents other co-executing threads from evicting reserved lines. Unreserved lines remain available for dynamic sharing among threads. NoMo requires only simple modifications to the cache replacement logic, making it straightforward to adopt. It requires no software support enabling it to automatically protect pre-existing binaries. NoMo results in performance degradation of about 1p on average. We demonstrate that NoMo can provide strong security guarantees for the AES and Blowfish encryption algorithms.

225 citations


Proceedings ArticleDOI
10 Apr 2012
TL;DR: The FlashTier design addresses three limitations of using traditional SSDs for caching, a system architecture built upon solid-state cache, a flash device with an interface designed for caching that can recover from the crash of a 100GB cache in only 2.4 seconds.
Abstract: The availability of high-speed solid-state storage has introduced a new tier into the storage hierarchy. Low-latency and high-IOPS solid-state drives (SSDs) cache data in front of high-capacity disks. However, most existing SSDs are designed to be a drop-in disk replacement, and hence are mismatched for use as a cache.This paper describes FlashTier, a system architecture built upon solid-state cache (SSC), a flash device with an interface designed for caching. Management software at the operating system block layer directs caching. The FlashTier design addresses three limitations of using traditional SSDs for caching. First, FlashTier provides a unified logical address space to reduce the cost of cache block management within both the OS and the SSD. Second, FlashTier provides cache consistency guarantees allowing the cached data to be used following a crash. Finally, FlashTier leverages cache behavior to silently evict data blocks during garbage collection to improve performance of the SSC.We have implemented an SSC simulator and a cache manager in Linux. In trace-based experiments, we show that FlashTier reduces address translation space by 60% and silent eviction improves performance by up to 167%. Furthermore, FlashTier can recover from the crash of a 100GB cache in only 2.4 seconds.

194 citations


Patent
23 Jan 2012
TL;DR: In this paper, an apparatus, system, and method are disclosed for destaging cached data in a nonvolatile solid-state storage device (NVS) with a cache controller.
Abstract: An apparatus, system, and method are disclosed for destaging cached data. A cache controller (116) detects one or more write requests to store data in a backing store (118). The cache controller (116) sends the write requests to a storage controller (104) for a nonvolatile solid-state storage device (102). The storage controller (104) receives the write requests and caches the data in the storage device (102) by appending the data to a log (940) of the storage device (102). The log (940) includes a sequential, log-based structure preserved in the storage device (102). The cache controller (116) receives at least a portion of the data from the storage controller (104) in an order favoring operation of the storage device (102) and destages the data to the backing store (118) in that order, which is selected so that operation of the storage device (102) is more efficient in response to destaging.

193 citations


Patent
31 Jul 2012
TL;DR: In this paper, a network cache intercepts data requested by a client from a remote server interconnected with the cache through one or more wide area network (WAN) links (e.g., for Wide Area File Services, or “WAFS”).
Abstract: According to one or more embodiments of the present invention, a network cache intercepts data requested by a client from a remote server interconnected with the cache through one or more wide area network (WAN) links (e.g., for Wide Area File Services, or “WAFS”). The network cache stores the data and sends the data to the client. The cache may then intercept a first write request for the data from the client to the remote server, and determine one or more portions of the data in the write request that changed from the data stored at the cache (e.g., according to one or more hashes created based on the data). The network cache then sends a second write request for only the changed portions of the data to the remote server.

Patent
13 Mar 2012
TL;DR: In this paper, a cache module at a client computer controls a cache portion on a storage device, and the cache module communicates with other cache modules at other clients to form a cache community.
Abstract: A cache module (26) at a client computer (12) controls a cache portion (28) on a storage device (24). The cache module communicates with other cache modules at other clients to form a cache community (15). The cache modules store World Wide Web or other content in the cache portions for retrieval in response to requests (32) for content from browsers (30) in the cache community. When the requested content is not available in the cache community, the requested content may be retrieved from an origin server (19) using the Internet.

Patent
30 Apr 2012
TL;DR: In this article, the authors use a cache controller of an integrated circuit to control a cache including cached data content and associated cache metadata, such that a bus operation initiated by the cache controller to target the cached data contents also targets the associated metadata.
Abstract: A technique includes using a cache controller of an integrated circuit to control a cache including cached data content and associated cache metadata. The technique includes storing the metadata and the cached data content off of the integrated circuit and organizing the storage of the metadata relative to the cached data content such that a bus operation initiated by the cache controller to target the cached data content also targets the associated metadata.

Proceedings ArticleDOI
01 Dec 2012
TL;DR: A new way to use dynamic reuse distances to further improve cache management policies is proposed which prevents replacing a cache line until a certain number of accesses to its cache set, called a Protecting Distance (PD).
Abstract: Cache management policies such as replacement, bypass, or shared cache partitioning have been relying on data reuse behavior to predict the future. This paper proposes a new way to use dynamic reuse distances to further improve such policies. A new replacement policy is proposed which prevents replacing a cache line until a certain number of accesses to its cache set, called a Protecting Distance (PD). The policy protects a cache line long enough for it to be reused, but not beyond that to avoid cache pollution. This can be combined with a bypass mechanism that also relies on dynamic reuse analysis to bypass lines with less expected reuse. A miss fetch is bypassed if there are no unprotected lines. A hit rate model based on dynamic reuse history is proposed and the PD that maximizes the hit rate is dynamically computed. The PD is recomputed periodically to track a program's memory access behavior and phases. Next, a new multi-core cache partitioning policy is proposed using the concept of protection. It manages lifetimes of lines from different cores (threads) in such a way that the overall hit rate is maximized. The average per-thread lifetime is reduced by decreasing the thread's PD. The single-core PD-based replacement policy with bypass achieves an average speedup of 4.2% over the DIP policy, while the average speedups over DIP are 1.5% for dynamic RRIP (DRRIP) and 1.6% for sampling dead-block prediction (SDP). The 16-core PD-based partitioning policy improves the average weighted IPC by 5.2%, throughput by 6.4% and fairness by 9.9% over thread-aware DRRIP (TA-DRRIP). The required hardware is evaluated and the overhead is shown to be manageable.

Proceedings ArticleDOI
25 Mar 2012
TL;DR: CacheShield can effectively improve cache performance under normal circumstances, and more importantly, shield CCN routers from cache pollution attacks, and is effective for both CCN and today's cache servers.
Abstract: With the advent of content-centric networking (CCN) where contents can be cached on each CCN router, cache robustness will soon emerge as a serious concern for CCN deployment. Previous studies on cache pollution attacks only focus on a single cache server. The question of how caching will behave over a general caching network such as CCN under cache pollution attacks has never been answered. In this paper, we propose a novel scheme called CacheShield for enhancing cache robustness. CacheShield is simple, easy-to-deploy, and applicable to any popular cache replacement policy. CacheShield can effectively improve cache performance under normal circumstances, and more importantly, shield CCN routers from cache pollution attacks. Extensive simulations including trace-driven simulations demonstrate that CacheShield is effective for both CCN and today's cache servers. We also study the impact of cache pollution attacks on CCN and reveal several new observations on how different attack scenarios can affect cache hit ratios unexpectedly.

Proceedings ArticleDOI
30 Jul 2012
TL;DR: TapeCache is proposed, a first attempt to employ DWMs as last-level caches in general purpose computing platforms and proposes a novel circuit-architecture co-design for TapeCache, consisting of a multi-port DWM macro-cell optimized for read operations considering the asymmetry in applications' read/write characteristics.
Abstract: Domain Wall Memory (DWM) is a recently developed spin-based memory technology in which several bits of data are densely packed into the domains of a ferromagnetic wire. DWM has shown great promise in enabling non-volatile memory with unprecedented density and high energy efficiency. In this work, we propose TapeCache, a first attempt to employ DWMs as last-level caches in general purpose computing platforms. DWMs enable much higher density compared to SRAM, DRAM, and other spin-based memory technologies such as STT-MRAM. However, they also pose unique challenges such as serial access to the bits stored in a DWM cell, leading to variable access latencies. We propose a novel circuit-architecture co-design for TapeCache, consisting of (i) a multi-port DWM macro-cell optimized for read operations considering the asymmetry in applications' read/write characteristics, and (ii) a new cache organization and suitable management policies that mitigate the performance penalty arising from serial access to bits in a macro-cell. Over a wide range of SPEC 2006 benchmarks, TapeCache achieves 7.8X improvement in area, an average energy improvement of 7.3X, and an average performance improvement of 1.2% compared to an iso-capacity SRAM cache. Compared to an iso-capacity STT-MRAM cache, TapeCache obtains 2.3X improvement in area and 1.4X average energy savings with virtually identical performance.

Patent
Serge Shats1, Steven Ted Sanford1
10 Aug 2012
TL;DR: In this paper, a method of caching data is performed by a respective computer having one or more processors, nonvolatile secondary storage and non-volatile cache memory, which includes identifying write requests to write data to the non-vivo cache memory.
Abstract: A method of caching data is performed by a respective computer having one or more processors storing one or more storage management programs for execution by the one or more processors, non-volatile secondary storage and non-volatile cache memory. The method includes receiving from the non-volatile cache memory information identifying an amount of available storage in the non-volatile cache memory, and identifying a size of the management units in the non-volatile cache memory. The method further includes identifying write requests to write data to the non-volatile cache memory, sequentially writing to the non-volatile cache memory the write data for the identified write requests, to sequentially arranged locations in an address space of the non-volatile cache memory, and storing in memory metadata that maps the addresses or storage offsets of the write data to respective locations in the address space of the non-volatile cache memory.

Patent
Jonathan M. Haswell1
31 May 2012
TL;DR: In this article, the authors present a cache management approach for maintaining a cache comprising a hash table including rows of data items in the cache, wherein each row in the hash table is associated with a hash value representing a logical block address (LBA) of each data item in that row.
Abstract: According to an embodiment of the invention, cache management comprises maintaining a cache comprising a hash table including rows of data items in the cache, wherein each row in the hash table is associated with a hash value representing a logical block address (LBA) of each data item in that row. Searching for a target data item in the cache includes calculating a hash value representing a LBA of the target data item, and using the hash value to index into a counting Bloom filter that indicates that the target data item is either not in the cache, indicating a cache miss, or that the target data item may be in the cache. If a cache miss is not indicated, using the hash value to select a row in the hash table, and indicating a cache miss if the target data item is not found in the selected row.

Journal ArticleDOI
TL;DR: This paper develops a timing analysis method for concurrent software running on multi-cores with a shared instruction cache that progressively improves the lifetime estimates of tasks that execute concurrently on multiple cores, in order to estimate potential conflicts in the shared cache.
Abstract: Memory accesses form an important source of timing unpredictability. Timing analysis of real-time embedded software thus requires bounding the time for memory accesses. Multiprocessing, a popular approach for performance enhancement, opens up the opportunity for concurrent execution. However due to contention for any shared memory by different processing cores, memory access behavior becomes more unpredictable, and hence harder to analyze. In this paper, we develop a timing analysis method for concurrent software running on multi-cores with a shared instruction cache. Communication across tasks is by message passing. Our method progressively improves the lifetime estimates of tasks that execute concurrently on multiple cores, in order to estimate potential conflicts in the shared cache. Possible conflicts arising from overlapping task lifetimes are accounted for in the hit-miss classification of accesses to the shared cache, to provide safe execution time bounds. We show that our method produces lower worst-case response time (WCRT) estimates than existing shared-cache analysis on a real-world embedded application. Furthermore, we also exploit instruction cache locking to improve WCRT. By locking some beneficial memory blocks into L1 cache, the WCET of the tasks and L2 cache conflicts are reduced, resulting in better WCRT. Experiments demonstrate that significant WCRT reduction is achieved through cache locking.

Patent
24 Jan 2012
TL;DR: In this article, an apparatus, system, and method for managing a cache is described, and a cache management module manages the at least one cache unit based on the cache management information exchanged with the one or more cache clients.
Abstract: An apparatus, system, and method are disclosed for managing a cache. A cache interface module provides access to a plurality of virtual storage units of a solid-state storage device over a cache interface. At least one of the virtual storage units comprises a cache unit. A cache command module exchanges cache management information for the at least one cache unit with one or more cache clients over the cache interface. A cache management module manages the at least one cache unit based on the cache management information exchanged with the one or more cache clients.

Proceedings ArticleDOI
14 Feb 2012
TL;DR: A dynamic scheme is presented that further divides the cache space into read and write caches and manages the three spaces according to the workload characteristics for optimal performance and improves performance of hybrid storage solutions up to the off-line optimal performance of a fixed partitioning scheme.
Abstract: Hybrid storage solutions use NAND flash memory based Solid State Drives (SSDs) as non-volatile cache and traditional Hard Disk Drives (HDDs) as lower level storage. Unlike a typical cache, internally, the flash memory cache is divided into cache space and overprovisioned space, used for garbage collection. We show that balancing the two spaces appropriately helps improve the performance of hybrid storage systems. We show that contrary to expectations, the cache need not be filled with data to the fullest, but may be better served by reserving space for garbage collection. For this balancing act, we present a dynamic scheme that further divides the cache space into read and write caches and manages the three spaces according to the workload characteristics for optimal performance. Experimental results show that our dynamic scheme improves performance of hybrid storage solutions up to the off-line optimal performance of a fixed partitioning scheme. Furthermore, as our scheme makes efficient use of the flash memory cache, it reduces the number of erase operations thereby extending the lifetime of SSDs.

Proceedings ArticleDOI
25 Feb 2012
TL;DR: A TLP-aware cache management policy for CPU-GPU heterogeneous architectures is proposed, and a core-sampling mechanism to detect how caching affects the performance of a GPGPU application is introduced.
Abstract: Combining CPUs and GPUs on the same chip has become a popular architectural trend. However, these heterogeneous architectures put more pressure on shared resource management. In particular, managing the last-level cache (LLC) is very critical to performance. Lately, many researchers have proposed several shared cache management mechanisms, including dynamic cache partitioning and promotion-based cache management, but no cache management work has been done on CPU-GPU heterogeneous architectures. Sharing the LLC between CPUs and GPUs brings new challenges due to the different characteristics of CPU and GPGPU applications. Unlike most memory-intensive CPU benchmarks that hide memory latency with caching, many GPGPU applications hide memory latency by combining thread-level parallelism (TLP) and caching. In this paper, we propose a TLP-aware cache management policy for CPU-GPU heterogeneous architectures. We introduce a core-sampling mechanism to detect how caching affects the performance of a GPGPU application. Inspired by previous cache management schemes, Utility-based Cache Partitioning (UCP) and Re-Reference Interval Prediction (RRIP), we propose two new mechanisms: TAP-UCP and TAP-RRIP. TAP-UCP improves performance by 5% over UCP and 11% over LRU on 152 heterogeneous workloads, and TAP-RRIP improves performance by 9% over RRIP and 12% over LRU.

Journal ArticleDOI
TL;DR: This paper introduces a new method of bounding pre-emption costs, called the ECB-Union approach, which complements an existing UCB-union approach and improves upon both of these approaches via the introduction of Multiset variants which reduce the amount of pessimism in the analysis.
Abstract: Without the use of caches the increasing gap between processor and memory speeds in modern embedded microprocessors would have resulted in memory access times becoming an unacceptable bottleneck. In such systems, cache related pre-emption delays can be a significant proportion of task execution times. To obtain tight bounds on the response times of tasks in pre-emptively scheduled systems, it is necessary to integrate worst-case execution time analysis and schedulability analysis via the use of an appropriate model of pre-emption costs. In this paper, we introduce a new method of bounding pre-emption costs, called the ECB-Union approach. The ECB-Union approach complements an existing UCB-Union approach. We improve upon both of these approaches via the introduction of Multiset variants which reduce the amount of pessimism in the analysis. Further, we combine these Multiset approaches into a simple composite approach that dominates both. These approaches to bounding pre-emption costs are integrated into response time analysis for fixed priority pre-emptively scheduled systems. Further, we extend this analysis to systems where tasks can access resources in mutual exclusion, in the process resolving omissions in existing models of pre-emption delays. A case study and empirical evaluation demonstrate the effectiveness of the ECB-Union, Multiset and combined approaches for a wide range of different cache configurations including cache utilization, cache set size, reuse, and block reload times.

Proceedings ArticleDOI
19 Sep 2012
TL;DR: Off-chip main memory has long been a bottleneck for system performance, and with increasing memory pressure due to multiple onchip cores, effective cache utilization is important.
Abstract: Off-chip main memory has long been a bottleneck for system performance. With increasing memory pressure due to multiple onchip cores, effective cache utilization is important. In a system with limited cache space, we would ideally like to prevent 1) cache pollution, i.e., blocks with low reuse evicting blocks with high reuse from the cache, and 2) cache thrashing, i.e., blocks with high reuse evicting each other from the cache.

Proceedings ArticleDOI
01 Dec 2012
TL;DR: Compared to a hard-partitioned design, the proposed unified local memory provides a performance benefit as high as 71% along with an energy reduction up to 33% and broadens the scope of applications that can be efficiently executed on GPUs.
Abstract: Modern throughput processors such as GPUs employ thousands of threads to drive high-bandwidth, long-latency memory systems. These threads require substantial on-chip storage for registers, cache, and scratchpad memory. Existing designs hard-partition this local storage, fixing the capacities of these structures at design time. We evaluate modern GPU workloads and find that they have widely varying capacity needs across these different functions. Therefore, we propose a unified local memory which can dynamically change the partitioning among registers, cache, and scratchpad on a per-application basis. The tuning that this flexibility enables improves both performance and energy consumption, and broadens the scope of applications that can be efficiently executed on GPUs. Compared to a hard-partitioned design, we show that unified local memory provides a performance benefit as high as 71% along with an energy reduction up to 33%.

Patent
26 Sep 2012
TL;DR: In this paper, a bit array is employed to store recency information in a memory element that is configured to store metadata for data objects stored in a separate cache memory element, which includes bit offset information for each of the keys denoting different slots in the bit array.
Abstract: Caching systems and methods for managing a cache are disclosed. One method includes determining whether a cache eviction condition is satisfied. In response to determining that the cache eviction condition is satisfied, at least one Bloom filter registering keys denoting objects in the cache is referenced to identify a particular object in the cache to evict. Further, the identified object is evicted from the cache. In accordance with an alternative scheme, a bit array is employed to store recency information in a memory element that is configured to store metadata for data objects stored in a separate cache memory element. This separate cache memory element stores keys denoting the data objects in the cache and further includes bit offset information for each of the keys denoting different slots in the bit array to enable access to the recency information.

Book ChapterDOI
07 Jul 2012
TL;DR: A novel method for automatically deriving upper bounds on the amount of information about the input that an adversary can extract from a program by observing the CPU's cache behavior by using a novel technique for efficient counting of concretizations of abstract cache states.
Abstract: The latency gap between caches and main memory has been successfully exploited for recovering sensitive input to programs, such as cryptographic keys from implementation of AES and RSA. So far, there are no practical general-purpose countermeasures against this threat. In this paper we propose a novel method for automatically deriving upper bounds on the amount of information about the input that an adversary can extract from a program by observing the CPU's cache behavior. At the heart of our approach is a novel technique for efficient counting of concretizations of abstract cache states that enables us to connect state-of-the-art techniques for static cache analysis and quantitative information-flow. We implement our counting procedure on top of the AbsInt TimingExplorer, one of the most advanced engines for static cache analysis. We use our tool to perform a case study where we derive upper bounds on the cache leakage of a 128-bit AES executable on an ARM processor. We also analyze this implementation with a commonly suggested (but until now heuristic) countermeasure applied, obtaining a formal account of the corresponding increase in security.

Patent
21 Dec 2012
TL;DR: In this article, a caching priority designator is assigned to an address that addresses information stored in a memory system, and the information is stored in the cacheline of a first level of cache memory in the memory system.
Abstract: A method of managing cache memory includes assigning a caching priority designator to an address that addresses information stored in a memory system. The information is stored in a cacheline of a first level of cache memory in the memory system. The cacheline is evicted from the first level of cache memory. A second level in the memory system to which to write back the information is determined based at least in part on the caching priority designator. The information is written back to the second level.

Proceedings ArticleDOI
03 Mar 2012
TL;DR: This study presents a detailed analysis on the interactions between intelligent scheduling and smart cache replacement policies and proposes Cache Replacement and Utility-aware Scheduling (CRUISE)-a hardware/software co-designed approach for shared cache management.
Abstract: When several applications are co-scheduled to run on a system with multiple shared LLCs, there is opportunity to improve system performance. This opportunity can be exploited by the hardware, software, or a combination of both hardware and software. The software, i.e., an operating system or hypervisor, can improve system performance by co-scheduling jobs on LLCs to minimize shared cache contention. The hardware can improve system throughput through better replacement policies by allocating more cache resources to applications that benefit from the cache and less to those applications that do not. This study presents a detailed analysis on the interactions between intelligent scheduling and smart cache replacement policies. We find that smart cache replacement reduces the burden on software to provide intelligent scheduling decisions. However, under smart cache replacement, there is still room to improve performance from better application co-scheduling. We find that co-scheduling decisions are a function of the underlying LLC replacement policy. We propose Cache Replacement and Utility-aware Scheduling (CRUISE)-a hardware/software co-designed approach for shared cache management. For 4-core and 8-core CMPs, we find that CRUISE approaches the performance of an ideal job co-scheduling policy under different LLC replacement policies.

Proceedings ArticleDOI
01 Dec 2012
TL;DR: Two innovations that exploit the bursty nature of memory requests to streamline the DRAM cache are presented, including a low-cost Hit-Miss Predictor (HMP) that virtually eliminates the hardware overhead of the previously proposed multi-megabyte Miss Map structure and a Self-Balancing Dispatch mechanism that dynamically sends some requests to the off-chip memory even though the request may have hit in the die-stackedDRAM cache.
Abstract: Die-stacking technology allows conventional DRAM to be integrated with processors. While numerous opportunities to make use of such stacked DRAM exist, one promising way is to use it as a large cache. Although previous studies show that DRAM caches can deliver performance benefits, there remain inefficiencies as well as significant hardware costs for auxiliary structures. This paper presents two innovations that exploit the bursty nature of memory requests to streamline the DRAM cache. The first is a low-cost Hit-Miss Predictor (HMP) that virtually eliminates the hardware overhead of the previously proposed multi-megabyte Miss Map structure. The second is a Self-Balancing Dispatch (SBD) mechanism that dynamically sends some requests to the off-chip memory even though the request may have hit in the die-stacked DRAM cache. This makes effective use of otherwise idle off-chip bandwidth when the DRAM cache is servicing a burst of cache hits. These techniques, however, are hampered by dirty (modified) data in the DRAM cache. To ensure correctness in the presence of dirty data in the cache, the HMP must verify that a block predicted as a miss is not actually present, otherwise the dirty block must be provided. This verification process can add latency, especially when DRAM cache banks are busy. In a similar vein, SBD cannot redirect requests to off-chip memory when a dirty copy of the block exists in the DRAM cache. To relax these constraints, we introduce a hybrid write policy for the cache that simultaneously supports write-through and write-back policies for different pages. Only a limited number of pages are permitted to operate in a write-back mode at one time, thereby bounding the amount of dirty data in the DRAM cache. By keeping the majority of the DRAM cache clean, most HMP predictions do not need to be verified, and the self balancing dispatch has more opportunities to redistribute requests (i.e., only requests to the limited number of dirty pages must go to the DRAM cache to maintain correctness). Our proposed techniques improve performance compared to the Miss Map-based DRAM cache approach while simultaneously eliminating the costly Miss Map structure.