scispace - formally typeset
Search or ask a question

Showing papers on "Cache algorithms published in 2012"


Proceedings ArticleDOI
01 Dec 2012
TL;DR: This paper proposes Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity.
Abstract: This paper studies the effects of hardware thread scheduling on cache management in GPUs. We propose Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity. In contrast to improvements in the replacement policy that can better tolerate difficult access patterns, CCWS shapes the access pattern to avoid thrashing the shared L1. We show that CCWS can outperform any replacement scheme by evaluating against the Belady-optimal policy. Our evaluation demonstrates that cache efficiency and preservation of intra-wave front locality become more important as GPU computing expands beyond use in high performance computing. At an estimated cost of 0.17% total chip area, CCWS reduces the number of threads actively issued on a core when appropriate. This leads to an average 25% fewer L1 data cache misses which results in a harmonic mean 24% performance improvement over previously proposed scheduling policies across a diverse selection of cache-sensitive workloads.

408 citations


Proceedings ArticleDOI
25 Mar 2012
TL;DR: This paper proposes a content caching scheme, WAVE, in which the number of chunks to be cached is adjusted based on the popularity of the content, which achieves higher cache hit ratio and fewer frequent cache replacements than other on-demand caching strategies.
Abstract: In content-oriented networking, content files are typically cached in network nodes, and hence how to cache content files is crucial for the efficient content delivery and cache storage utilization. In this paper, we propose a content caching scheme, WAVE, in which the number of chunks to be cached is adjusted based on the popularity of the content. In WAVE, an upstream node recommends the number of chunks to be cached at its downstream node, which is exponentially increased as the request count increases. Simulation results reveal that the average hop count of content delivery of WAVE is lower than other schemes, and the inter-ISP traffic volume of WAVE is the second lowest (CDN is the lowest). Also, WAVE achieves higher cache hit ratio and fewer frequent cache replacements than other on-demand caching strategies.

349 citations


Proceedings ArticleDOI
19 Sep 2012
TL;DR: There is a need for a simple yet efficient compression technique that can effectively compress common in-cache data patterns, and has minimal effect on cache access latency.
Abstract: Cache compression is a promising technique to increase on-chip cache capacity and to decrease on-chip and off-chip bandwidth usage. Unfortunately, directly applying well-known compression algorithms (usually implemented in software) leads to high hardware complexity and unacceptable decompression/compression latencies, which in turn can negatively affect performance. Hence, there is a need for a simple yet efficient compression technique that can effectively compress common in-cache data patterns, and has minimal effect on cache access latency.

348 citations


Proceedings Article
08 Aug 2012
TL;DR: STEALTHMEM is presented, a system-level protection mechanism against cache-based side channel attacks in the cloud and a novel idea and prototype for isolating cache lines while fully utilizing memory by exploiting architectural properties of set-associative caches.
Abstract: Cloud services are rapidly gaining adoption due to the promises of cost efficiency, availability, and on-demand scaling. To achieve these promises, cloud providers share physical resources to support multi-tenancy of cloud platforms. However, the possibility of sharing the same hardware with potential attackers makes users reluctant to offload sensitive data into the cloud. Worse yet, researchers have demonstrated side channel attacks via shared memory caches to break full encryption keys of AES, DES, and RSA. We present STEALTHMEM, a system-level protection mechanism against cache-based side channel attacks in the cloud. STEALTHMEM manages a set of locked cache lines per core, which are never evicted from the cache, and efficiently multiplexes them so that each VM can load its own sensitive data into the locked cache lines. Thus, any VM can hide memory access patterns on confidential data from other VMs. Unlike existing state-of-the-art mitigation methods, STEALTHMEM works with existing commodity hardware and does not require profound changes to application software. We also present a novel idea and prototype for isolating cache lines while fully utilizing memory by exploiting architectural properties of set-associative caches. STEALTHMEM imposes 5.9% of performance overhead on the SPEC 2006 CPU benchmark, and between 2% and 5% overhead on secured AES, DES and Blowfish, requiring only between 3 and 34 lines of code changes from the original implementations.

336 citations


Journal ArticleDOI
TL;DR: On-chip hardware coherence can scale gracefully as the number of cores increases, and the value of these cores can increase with increasing number of processors.
Abstract: Today’s multicore chips commonly implement shared memory with cache coherence as low-level support for operating systems and application software. Technology trends continue to enable the scaling of the number of (processor) cores per chip. Because conventional wisdom says that the coherence does not scale well to many cores, some prognosticators predict the end of coherence. This paper seeks to refute this conventional wisdom by showing one way to scale on-chip cache coherence with bounded, modest costs by combining known techniques such as: shared caches augmented to track cached copies, explicit cache eviction notifications, and hierarchical design. Based on this scalable proof-of-concept design, we predict that on-chip coherence and the programming convenience and compatibility it provides are here to stay.

298 citations


Proceedings ArticleDOI
25 Mar 2012
TL;DR: Results demonstrate that caching VoD in access routers offers a highly favorable bandwidth memory tradeoff but that the other types of content would likely be more efficiently handled in very large capacity storage devices in the core.
Abstract: For a realistic traffic mix, we evaluate the hit rates attained in a two-layer cache hierarchy designed to reduce Internet bandwidth requirements. The model identifies four main types of content, web, file sharing, user generated content and video on demand, distinguished in terms of their traffic shares, their population and object sizes and their popularity distributions. Results demonstrate that caching VoD in access routers offers a highly favorable bandwidth memory tradeoff but that the other types of content would likely be more efficiently handled in very large capacity storage devices in the core. Evaluations are based on a simple approximation for LRU cache performance that proves highly accurate in relevant configurations.

267 citations


Proceedings ArticleDOI
01 Dec 2012
TL;DR: This paper proposes a latency-optimized cache architecture, called Alloy Cache, that eliminates the delay due to tag serialization by streaming tag and data together in a single burst, and proposes a simple and highly effective Memory Access Predictor.
Abstract: This paper analyzes the design trade-offs in architecting large-scale DRAM caches. Prior research, including the recent work from Loh and Hill, have organized DRAM caches similar to conventional caches. In this paper, we contend that some of the basic design decisions typically made for conventional caches (such as serialization of tag and data access, large associativity, and update of replacement state) are detrimental to the performance of DRAM caches, as they exacerbate the already high hit latency. We show that higher performance can be obtained by optimizing the DRAM cache architecture first for latency, and then for hit rate. We propose a latency-optimized cache architecture, called Alloy Cache, that eliminates the delay due to tag serialization by streaming tag and data together in a single burst. We also propose a simple and highly effective Memory Access Predictor that incurs a storage overhead of 96 bytes per core and a latency of 1 cycle. It helps service cache misses faster without the need to wait for a cache miss detection in the common case. Our evaluations show that our latency-optimized cache design significantly outperforms both the recent proposal from Loh and Hill, as well as an impractical SRAM Tag-Store design that incurs an unacceptable overhead of several tens of megabytes. On average, the proposal from Loh and Hill provides 8.7% performance improvement, the "idealized" SRAM Tag design provides 24%, and our simple latency-optimized design provides 35%.

259 citations


Proceedings ArticleDOI
04 Sep 2012
TL;DR: The approximation is particularly useful in evaluating the performance of current proposals for an information centric network where other approaches fail due to the very large populations of cacheable objects to be taken into account and to their complex popularity law.
Abstract: In a 2002 paper, Che and co-authors proposed a simple approach for estimating the hit rates of a cache operating the least recently used (LRU) replacement policy. The approximation proves remarkably accurate and is applicable to quite general distributions of object popularity. This paper provides a mathematical explanation for the success of the approximation, notably in configurations where the intuitive arguments of Che et al. clearly do not apply. The approximation is particularly useful in evaluating the performance of current proposals for an information centric network where other approaches fail due to the very large populations of cacheable objects to be taken into account and to their complex popularity law, resulting from the mix of different content types and the filtering effect induced by the lower layers in a cache hierarchy.

259 citations


Journal ArticleDOI
26 Jan 2012
TL;DR: A flexibly-partitioned cache design that either drastically weakens or completely eliminates cache-based side channel attacks, and can provide strong security guarantees for the AES and Blowfish encryption algorithms.
Abstract: We propose a flexibly-partitioned cache design that either drastically weakens or completely eliminates cache-based side channel attacks. The proposed Non-Monopolizable (NoMo) cache dynamically reserves cache lines for active threads and prevents other co-executing threads from evicting reserved lines. Unreserved lines remain available for dynamic sharing among threads. NoMo requires only simple modifications to the cache replacement logic, making it straightforward to adopt. It requires no software support enabling it to automatically protect pre-existing binaries. NoMo results in performance degradation of about 1p on average. We demonstrate that NoMo can provide strong security guarantees for the AES and Blowfish encryption algorithms.

225 citations


Posted Content
TL;DR: Che et al. as discussed by the authors proposed a simple approach for estimating the hit rates of a cache operating the least recently used (LRU) replacement policy, which proves remarkably accurate and is applicable to quite general distributions of object popularity.
Abstract: In a 2002 paper, Che and co-authors proposed a simple approach for estimating the hit rates of a cache operating the least recently used (LRU) replacement policy. The approximation proves remarkably accurate and is applicable to quite general distributions of object popularity. This paper provides a mathematical explanation for the success of the approximation, notably in configurations where the intuitive arguments of Che, et al clearly do not apply. The approximation is particularly useful in evaluating the performance of current proposals for an information centric network where other approaches fail due to the very large populations of cacheable objects to be taken into account and to their complex popularity law, resulting from the mix of different content types and the filtering effect induced by the lower layers in a cache hierarchy.

225 citations


Proceedings ArticleDOI
10 Apr 2012
TL;DR: The FlashTier design addresses three limitations of using traditional SSDs for caching, a system architecture built upon solid-state cache, a flash device with an interface designed for caching that can recover from the crash of a 100GB cache in only 2.4 seconds.
Abstract: The availability of high-speed solid-state storage has introduced a new tier into the storage hierarchy. Low-latency and high-IOPS solid-state drives (SSDs) cache data in front of high-capacity disks. However, most existing SSDs are designed to be a drop-in disk replacement, and hence are mismatched for use as a cache.This paper describes FlashTier, a system architecture built upon solid-state cache (SSC), a flash device with an interface designed for caching. Management software at the operating system block layer directs caching. The FlashTier design addresses three limitations of using traditional SSDs for caching. First, FlashTier provides a unified logical address space to reduce the cost of cache block management within both the OS and the SSD. Second, FlashTier provides cache consistency guarantees allowing the cached data to be used following a crash. Finally, FlashTier leverages cache behavior to silently evict data blocks during garbage collection to improve performance of the SSC.We have implemented an SSC simulator and a cache manager in Linux. In trace-based experiments, we show that FlashTier reduces address translation space by 60% and silent eviction improves performance by up to 167%. Furthermore, FlashTier can recover from the crash of a 100GB cache in only 2.4 seconds.

Patent
23 Jan 2012
TL;DR: In this paper, an apparatus, system, and method are disclosed for destaging cached data in a nonvolatile solid-state storage device (NVS) with a cache controller.
Abstract: An apparatus, system, and method are disclosed for destaging cached data. A cache controller (116) detects one or more write requests to store data in a backing store (118). The cache controller (116) sends the write requests to a storage controller (104) for a nonvolatile solid-state storage device (102). The storage controller (104) receives the write requests and caches the data in the storage device (102) by appending the data to a log (940) of the storage device (102). The log (940) includes a sequential, log-based structure preserved in the storage device (102). The cache controller (116) receives at least a portion of the data from the storage controller (104) in an order favoring operation of the storage device (102) and destages the data to the backing store (118) in that order, which is selected so that operation of the storage device (102) is more efficient in response to destaging.

Patent
31 Jul 2012
TL;DR: In this paper, a network cache intercepts data requested by a client from a remote server interconnected with the cache through one or more wide area network (WAN) links (e.g., for Wide Area File Services, or “WAFS”).
Abstract: According to one or more embodiments of the present invention, a network cache intercepts data requested by a client from a remote server interconnected with the cache through one or more wide area network (WAN) links (e.g., for Wide Area File Services, or “WAFS”). The network cache stores the data and sends the data to the client. The cache may then intercept a first write request for the data from the client to the remote server, and determine one or more portions of the data in the write request that changed from the data stored at the cache (e.g., according to one or more hashes created based on the data). The network cache then sends a second write request for only the changed portions of the data to the remote server.

Patent
13 Mar 2012
TL;DR: In this paper, a cache module at a client computer controls a cache portion on a storage device, and the cache module communicates with other cache modules at other clients to form a cache community.
Abstract: A cache module (26) at a client computer (12) controls a cache portion (28) on a storage device (24). The cache module communicates with other cache modules at other clients to form a cache community (15). The cache modules store World Wide Web or other content in the cache portions for retrieval in response to requests (32) for content from browsers (30) in the cache community. When the requested content is not available in the cache community, the requested content may be retrieved from an origin server (19) using the Internet.

Patent
30 Apr 2012
TL;DR: In this article, the authors use a cache controller of an integrated circuit to control a cache including cached data content and associated cache metadata, such that a bus operation initiated by the cache controller to target the cached data contents also targets the associated metadata.
Abstract: A technique includes using a cache controller of an integrated circuit to control a cache including cached data content and associated cache metadata. The technique includes storing the metadata and the cached data content off of the integrated circuit and organizing the storage of the metadata relative to the cached data content such that a bus operation initiated by the cache controller to target the cached data content also targets the associated metadata.

Proceedings ArticleDOI
01 Dec 2012
TL;DR: A new way to use dynamic reuse distances to further improve cache management policies is proposed which prevents replacing a cache line until a certain number of accesses to its cache set, called a Protecting Distance (PD).
Abstract: Cache management policies such as replacement, bypass, or shared cache partitioning have been relying on data reuse behavior to predict the future. This paper proposes a new way to use dynamic reuse distances to further improve such policies. A new replacement policy is proposed which prevents replacing a cache line until a certain number of accesses to its cache set, called a Protecting Distance (PD). The policy protects a cache line long enough for it to be reused, but not beyond that to avoid cache pollution. This can be combined with a bypass mechanism that also relies on dynamic reuse analysis to bypass lines with less expected reuse. A miss fetch is bypassed if there are no unprotected lines. A hit rate model based on dynamic reuse history is proposed and the PD that maximizes the hit rate is dynamically computed. The PD is recomputed periodically to track a program's memory access behavior and phases. Next, a new multi-core cache partitioning policy is proposed using the concept of protection. It manages lifetimes of lines from different cores (threads) in such a way that the overall hit rate is maximized. The average per-thread lifetime is reduced by decreasing the thread's PD. The single-core PD-based replacement policy with bypass achieves an average speedup of 4.2% over the DIP policy, while the average speedups over DIP are 1.5% for dynamic RRIP (DRRIP) and 1.6% for sampling dead-block prediction (SDP). The 16-core PD-based partitioning policy improves the average weighted IPC by 5.2%, throughput by 6.4% and fairness by 9.9% over thread-aware DRRIP (TA-DRRIP). The required hardware is evaluated and the overhead is shown to be manageable.

Proceedings ArticleDOI
01 Dec 2012
TL;DR: Numerical results suggest that, under optimal cache schemes, the maximum QoE measurement, i.e., mean-opinion-score (MOS), is a concave function of the allowable storage size, which can provide high expectedQoE with low complexity, shedding light on the design of HTTP ABR streaming services over wireless networks.
Abstract: In this paper, we investigate the problem of how to cache a set of media files with optimal streaming rates, under HTTP adaptive bit rate streaming over wireless networks. The design objective is to achieve the optimal expected QoE under a limited storage budget, which is measured by the logarithmic relation between the required bit rate and the actual streaming bit rate. We formulate the content cache management of streaming files as a constrained optimization problem. Lagrange multiplier method is employed, and we obtain the numerical solution of the optimal streaming files. Particularly, we characterize the properties of the solution, and find there is a fundamental phase change in the optimal solution as the number of cached files grows. Moreover, the simulation results indicate that with the increase of cache size, more copies of different bit rate should be cached for a better QoE. Our comprehensive investigation reveals insightful guidelines to provide HTTP ABR streaming services over wireless networks.

Proceedings ArticleDOI
25 Mar 2012
TL;DR: CacheShield can effectively improve cache performance under normal circumstances, and more importantly, shield CCN routers from cache pollution attacks, and is effective for both CCN and today's cache servers.
Abstract: With the advent of content-centric networking (CCN) where contents can be cached on each CCN router, cache robustness will soon emerge as a serious concern for CCN deployment. Previous studies on cache pollution attacks only focus on a single cache server. The question of how caching will behave over a general caching network such as CCN under cache pollution attacks has never been answered. In this paper, we propose a novel scheme called CacheShield for enhancing cache robustness. CacheShield is simple, easy-to-deploy, and applicable to any popular cache replacement policy. CacheShield can effectively improve cache performance under normal circumstances, and more importantly, shield CCN routers from cache pollution attacks. Extensive simulations including trace-driven simulations demonstrate that CacheShield is effective for both CCN and today's cache servers. We also study the impact of cache pollution attacks on CCN and reveal several new observations on how different attack scenarios can affect cache hit ratios unexpectedly.

Proceedings ArticleDOI
30 Jul 2012
TL;DR: TapeCache is proposed, a first attempt to employ DWMs as last-level caches in general purpose computing platforms and proposes a novel circuit-architecture co-design for TapeCache, consisting of a multi-port DWM macro-cell optimized for read operations considering the asymmetry in applications' read/write characteristics.
Abstract: Domain Wall Memory (DWM) is a recently developed spin-based memory technology in which several bits of data are densely packed into the domains of a ferromagnetic wire. DWM has shown great promise in enabling non-volatile memory with unprecedented density and high energy efficiency. In this work, we propose TapeCache, a first attempt to employ DWMs as last-level caches in general purpose computing platforms. DWMs enable much higher density compared to SRAM, DRAM, and other spin-based memory technologies such as STT-MRAM. However, they also pose unique challenges such as serial access to the bits stored in a DWM cell, leading to variable access latencies. We propose a novel circuit-architecture co-design for TapeCache, consisting of (i) a multi-port DWM macro-cell optimized for read operations considering the asymmetry in applications' read/write characteristics, and (ii) a new cache organization and suitable management policies that mitigate the performance penalty arising from serial access to bits in a macro-cell. Over a wide range of SPEC 2006 benchmarks, TapeCache achieves 7.8X improvement in area, an average energy improvement of 7.3X, and an average performance improvement of 1.2% compared to an iso-capacity SRAM cache. Compared to an iso-capacity STT-MRAM cache, TapeCache obtains 2.3X improvement in area and 1.4X average energy savings with virtually identical performance.

Patent
Serge Shats1, Steven Ted Sanford1
10 Aug 2012
TL;DR: In this paper, a method of caching data is performed by a respective computer having one or more processors, nonvolatile secondary storage and non-volatile cache memory, which includes identifying write requests to write data to the non-vivo cache memory.
Abstract: A method of caching data is performed by a respective computer having one or more processors storing one or more storage management programs for execution by the one or more processors, non-volatile secondary storage and non-volatile cache memory. The method includes receiving from the non-volatile cache memory information identifying an amount of available storage in the non-volatile cache memory, and identifying a size of the management units in the non-volatile cache memory. The method further includes identifying write requests to write data to the non-volatile cache memory, sequentially writing to the non-volatile cache memory the write data for the identified write requests, to sequentially arranged locations in an address space of the non-volatile cache memory, and storing in memory metadata that maps the addresses or storage offsets of the write data to respective locations in the address space of the non-volatile cache memory.

Patent
Jonathan M. Haswell1
31 May 2012
TL;DR: In this article, the authors present a cache management approach for maintaining a cache comprising a hash table including rows of data items in the cache, wherein each row in the hash table is associated with a hash value representing a logical block address (LBA) of each data item in that row.
Abstract: According to an embodiment of the invention, cache management comprises maintaining a cache comprising a hash table including rows of data items in the cache, wherein each row in the hash table is associated with a hash value representing a logical block address (LBA) of each data item in that row. Searching for a target data item in the cache includes calculating a hash value representing a LBA of the target data item, and using the hash value to index into a counting Bloom filter that indicates that the target data item is either not in the cache, indicating a cache miss, or that the target data item may be in the cache. If a cache miss is not indicated, using the hash value to select a row in the hash table, and indicating a cache miss if the target data item is not found in the selected row.

Proceedings ArticleDOI
29 Oct 2012
TL;DR: This work develops popularity-driven caching schemes which dynamically place the replicas in the caches on the en-route path in a coordination fashion and outperform the widely used schemes in terms of the inter-ISP traffic and the average number of access hops.
Abstract: The built-in caching capability of future Named Data Networking (NDN) promises to enable effective content distribution at a global scale without requiring special infrastructure. The aim of this work is to design efficient caching schemes in NDN to achieve better performance at both the network layer and application layer. With the specific objective of minimizing the inter-ISP (Internet Service Provider) traffic and average access latency, we first formulate the optimization problems for different objectives and then solve them to obtain the optimal replica placement. Then we develop popularity-driven caching schemes which dynamically place the replicas in the caches on the en-route path in a coordination fashion. Simulation results show that the performances of our caching algorithms are much closer to the optimum and outperform the widely used schemes in terms of the inter-ISP traffic and the average number of access hops. Finally, we thoroughly evaluate the impact of several important design issues such as network topology, cache size, access pattern and content popularity on the caching performance and demonstrate that the proposed schemes are effective, stable, scalable and with reasonably light overhead.

Journal ArticleDOI
TL;DR: This paper develops a timing analysis method for concurrent software running on multi-cores with a shared instruction cache that progressively improves the lifetime estimates of tasks that execute concurrently on multiple cores, in order to estimate potential conflicts in the shared cache.
Abstract: Memory accesses form an important source of timing unpredictability. Timing analysis of real-time embedded software thus requires bounding the time for memory accesses. Multiprocessing, a popular approach for performance enhancement, opens up the opportunity for concurrent execution. However due to contention for any shared memory by different processing cores, memory access behavior becomes more unpredictable, and hence harder to analyze. In this paper, we develop a timing analysis method for concurrent software running on multi-cores with a shared instruction cache. Communication across tasks is by message passing. Our method progressively improves the lifetime estimates of tasks that execute concurrently on multiple cores, in order to estimate potential conflicts in the shared cache. Possible conflicts arising from overlapping task lifetimes are accounted for in the hit-miss classification of accesses to the shared cache, to provide safe execution time bounds. We show that our method produces lower worst-case response time (WCRT) estimates than existing shared-cache analysis on a real-world embedded application. Furthermore, we also exploit instruction cache locking to improve WCRT. By locking some beneficial memory blocks into L1 cache, the WCET of the tasks and L2 cache conflicts are reduced, resulting in better WCRT. Experiments demonstrate that significant WCRT reduction is achieved through cache locking.

Patent
24 Jan 2012
TL;DR: In this article, an apparatus, system, and method for managing a cache is described, and a cache management module manages the at least one cache unit based on the cache management information exchanged with the one or more cache clients.
Abstract: An apparatus, system, and method are disclosed for managing a cache. A cache interface module provides access to a plurality of virtual storage units of a solid-state storage device over a cache interface. At least one of the virtual storage units comprises a cache unit. A cache command module exchanges cache management information for the at least one cache unit with one or more cache clients over the cache interface. A cache management module manages the at least one cache unit based on the cache management information exchanged with the one or more cache clients.

Proceedings ArticleDOI
14 Feb 2012
TL;DR: A dynamic scheme is presented that further divides the cache space into read and write caches and manages the three spaces according to the workload characteristics for optimal performance and improves performance of hybrid storage solutions up to the off-line optimal performance of a fixed partitioning scheme.
Abstract: Hybrid storage solutions use NAND flash memory based Solid State Drives (SSDs) as non-volatile cache and traditional Hard Disk Drives (HDDs) as lower level storage. Unlike a typical cache, internally, the flash memory cache is divided into cache space and overprovisioned space, used for garbage collection. We show that balancing the two spaces appropriately helps improve the performance of hybrid storage systems. We show that contrary to expectations, the cache need not be filled with data to the fullest, but may be better served by reserving space for garbage collection. For this balancing act, we present a dynamic scheme that further divides the cache space into read and write caches and manages the three spaces according to the workload characteristics for optimal performance. Experimental results show that our dynamic scheme improves performance of hybrid storage solutions up to the off-line optimal performance of a fixed partitioning scheme. Furthermore, as our scheme makes efficient use of the flash memory cache, it reduces the number of erase operations thereby extending the lifetime of SSDs.

Proceedings ArticleDOI
25 Feb 2012
TL;DR: A TLP-aware cache management policy for CPU-GPU heterogeneous architectures is proposed, and a core-sampling mechanism to detect how caching affects the performance of a GPGPU application is introduced.
Abstract: Combining CPUs and GPUs on the same chip has become a popular architectural trend. However, these heterogeneous architectures put more pressure on shared resource management. In particular, managing the last-level cache (LLC) is very critical to performance. Lately, many researchers have proposed several shared cache management mechanisms, including dynamic cache partitioning and promotion-based cache management, but no cache management work has been done on CPU-GPU heterogeneous architectures. Sharing the LLC between CPUs and GPUs brings new challenges due to the different characteristics of CPU and GPGPU applications. Unlike most memory-intensive CPU benchmarks that hide memory latency with caching, many GPGPU applications hide memory latency by combining thread-level parallelism (TLP) and caching. In this paper, we propose a TLP-aware cache management policy for CPU-GPU heterogeneous architectures. We introduce a core-sampling mechanism to detect how caching affects the performance of a GPGPU application. Inspired by previous cache management schemes, Utility-based Cache Partitioning (UCP) and Re-Reference Interval Prediction (RRIP), we propose two new mechanisms: TAP-UCP and TAP-RRIP. TAP-UCP improves performance by 5% over UCP and 11% over LRU on 152 heterogeneous workloads, and TAP-RRIP improves performance by 9% over RRIP and 12% over LRU.

Journal ArticleDOI
TL;DR: This paper introduces a new method of bounding pre-emption costs, called the ECB-Union approach, which complements an existing UCB-union approach and improves upon both of these approaches via the introduction of Multiset variants which reduce the amount of pessimism in the analysis.
Abstract: Without the use of caches the increasing gap between processor and memory speeds in modern embedded microprocessors would have resulted in memory access times becoming an unacceptable bottleneck. In such systems, cache related pre-emption delays can be a significant proportion of task execution times. To obtain tight bounds on the response times of tasks in pre-emptively scheduled systems, it is necessary to integrate worst-case execution time analysis and schedulability analysis via the use of an appropriate model of pre-emption costs. In this paper, we introduce a new method of bounding pre-emption costs, called the ECB-Union approach. The ECB-Union approach complements an existing UCB-Union approach. We improve upon both of these approaches via the introduction of Multiset variants which reduce the amount of pessimism in the analysis. Further, we combine these Multiset approaches into a simple composite approach that dominates both. These approaches to bounding pre-emption costs are integrated into response time analysis for fixed priority pre-emptively scheduled systems. Further, we extend this analysis to systems where tasks can access resources in mutual exclusion, in the process resolving omissions in existing models of pre-emption delays. A case study and empirical evaluation demonstrate the effectiveness of the ECB-Union, Multiset and combined approaches for a wide range of different cache configurations including cache utilization, cache set size, reuse, and block reload times.

Proceedings ArticleDOI
19 Sep 2012
TL;DR: Off-chip main memory has long been a bottleneck for system performance, and with increasing memory pressure due to multiple onchip cores, effective cache utilization is important.
Abstract: Off-chip main memory has long been a bottleneck for system performance. With increasing memory pressure due to multiple onchip cores, effective cache utilization is important. In a system with limited cache space, we would ideally like to prevent 1) cache pollution, i.e., blocks with low reuse evicting blocks with high reuse from the cache, and 2) cache thrashing, i.e., blocks with high reuse evicting each other from the cache.

Proceedings ArticleDOI
11 Dec 2012
TL;DR: This paper introduces a Time-To-Live based policy, that assigns a timer to each content stored in the cache and redraws the timer each time the content is requested (at each hit/miss).
Abstract: Many researchers have been working on the performance analysis of caching in Information-Centric Networks (ICNs) under various replacement policies like Least Recently Used (LRU), FIFO or Random (RND). However, no exact results are provided, and many approximate models do not scale even for the simple network of two caches connected in tandem. In this paper, we introduce a Time-To-Live based policy (TTL), that assigns a timer to each content stored in the cache and redraws the timer each time the content is requested (at each hit/miss). We show that our TTL policy is more general than LRU, FIFO or RND, since it is able to mimic their behavior under an appropriate choice of its parameters. Moreover, the analysis of networks of TTL-based caches appears simpler not only under the Independent Reference Model (IRM, on which many existing results rely) but also with the Renewal Model for requests. In particular, we determine exact formulas for the performance metrics of interest for a linear network and a tree network with one root cache and N leaf caches. For more general networks, we propose an approximate solution with the relative errors smaller than 10−3 and 10−2 for exponentially distributed and constant TTLs respectively.

Proceedings ArticleDOI
01 Dec 2012
TL;DR: Compared to a hard-partitioned design, the proposed unified local memory provides a performance benefit as high as 71% along with an energy reduction up to 33% and broadens the scope of applications that can be efficiently executed on GPUs.
Abstract: Modern throughput processors such as GPUs employ thousands of threads to drive high-bandwidth, long-latency memory systems. These threads require substantial on-chip storage for registers, cache, and scratchpad memory. Existing designs hard-partition this local storage, fixing the capacities of these structures at design time. We evaluate modern GPU workloads and find that they have widely varying capacity needs across these different functions. Therefore, we propose a unified local memory which can dynamically change the partitioning among registers, cache, and scratchpad on a per-application basis. The tuning that this flexibility enables improves both performance and energy consumption, and broadens the scope of applications that can be efficiently executed on GPUs. Compared to a hard-partitioned design, we show that unified local memory provides a performance benefit as high as 71% along with an energy reduction up to 33%.