Showing papers on "Cache invalidation published in 2015"

PDF

Open Access

Proceedings Article•DOI•

Cache template attacks: automating attacks on inclusive last-level caches

[...]

Daniel Gruss¹, Raphael Spreitzer¹, Stefan Mangard¹•Institutions (1)

12 Aug 2015

TL;DR: An automated attack on the T-table-based AES implementation of OpenSSL that is as efficient as state-of-the-art manual cache attacks and can reduce the entropy per character from log2(26) = 4.7 to 1.4 bits on Linux systems is performed.

...read moreread less

Abstract: Recent work on cache attacks has shown that CPU caches represent a powerful source of information leakage. However, existing attacks require manual identification of vulnerabilities, i.e., data accesses or instruction execution depending on secret information. In this paper, we present Cache Template Attacks. This generic attack technique allows us to profile and exploit cache-based information leakage of any program automatically, without prior knowledge of specific software versions or even specific system information. Cache Template Attacks can be executed online on a remote system without any prior offline computations or measurements. Cache Template Attacks consist of two phases. In the profiling phase, we determine dependencies between the processing of secret information, e.g., specific key inputs or private keys of cryptographic primitives, and specific cache accesses. In the exploitation phase, we derive the secret values based on observed cache accesses. We illustrate the power of the presented approach in several attacks, but also in a useful application for developers. Among the presented attacks is the application of Cache Template Attacks to infer keystrokes and--even more severe--the identification of specific keys on Linux and Windows user interfaces. More specifically, for lowercase only passwords, we can reduce the entropy per character from log2(26) = 4.7 to 1.4 bits on Linux systems. Furthermore, we perform an automated attack on the T-table-based AES implementation of OpenSSL that is as efficient as state-of-the-art manual cache attacks.

...read moreread less

387 citations

Book Chapter•DOI•

Reverse Engineering Intel Last-Level Cache Complex Addressing Using Performance Counters

[...]

Clémentine Maurice¹, Nicolas Le Scouarnec, Christoph Neumann, Olivier Heen, Aurélien Francillon¹ - Show less +1 more•Institutions (1)

Institut Eurécom¹

02 Nov 2015

TL;DR: An automatic and generic method for reverse engineering Intel's last-level cache complex addressing, consequently rendering the class of cache attacks highly practical and giving a more precise description of the complex addressing function than previous work.

...read moreread less

Abstract: Cache attacks, which exploit differences in timing to perform covert or side channels, are now well understood. Recent works leverage the last level cache to perform cache attacks across cores. This cache is split in slices, with one slice per core. While predicting the slices used by an address is simple in older processors, recent processors are using an undocumented technique called complex addressing. This renders some attacks more difficult and makes other attacks impossible, because of the loss of precision in the prediction of cache collisions. In this paper, we build an automatic and generic method for reverse engineering Intel's last-level cache complex addressing, consequently rendering the class of cache attacks highly practical. Our method relies on CPU hardware performance counters to determine the cache slice an address is mapped to. We show that our method gives a more precise description of the complex addressing function than previous work. We validated our method by reversing the complex addressing functions on a diverse set of Intel processors. This set encompasses Sandy Bridge, Ivy Bridge and Haswell micro-architectures, with different number of cores, for mobile and server ranges of processors. We show the correctness of our function by building a covert channel. Finally, we discuss how other attacks benefit from knowing the complex addressing of a cache, such as sandboxed rowhammer.

...read moreread less

161 citations

Journal Article•DOI•

CacheAudit: A Tool for the Static Analysis of Cache Side Channels

[...]

Goran Doychev¹, Boris Köpf¹, Laurent Mauborgne, Jan Reineke²•Institutions (2)

IMDEA¹, Saarland University²

09 Jun 2015-ACM Transactions on Information and System Security

TL;DR: CacheAudit as mentioned in this paper analyzes cache side channels by observing cache states, traces of hits and misses, and execution times, and derives formal, quantitative security guarantees for a comprehensive set of side-channel adversaries.

...read moreread less

Abstract: We present CacheAudit, a versatile framework for the automatic, static analysis of cache side channels. CacheAudit takes as input a program binary and a cache configuration and derives formal, quantitative security guarantees for a comprehensive set of side-channel adversaries, namely, those based on observing cache states, traces of hits and misses, and execution times. Our technical contributions include novel abstractions to efficiently compute precise overapproximations of the possible side-channel observations for each of these adversaries. These approximations then yield upper bounds on the amount of information that is revealed.In case studies, we apply CacheAudit to binary executables of algorithms for sorting and encryption, including the AES implementation from the PolarSSL library, and the reference implementations of the finalists of the eSTREAM stream cipher competition. The results we obtain exhibit the influence of cache size, line size, associativity, replacement policy, and coding style on the security of the executables and include the first formal proofs of security for implementations with countermeasures such as preloading and data-independent memory access patterns.

...read moreread less

159 citations

Proceedings Article•DOI•

Improved approximation of storage-rate tradeoff for caching via new outer bounds

[...]

Avik Sengupta¹, Ravi Tandon¹, T. Charles Clancy¹•Institutions (1)

Virginia Tech¹

14 Jun 2015

TL;DR: A new information theoretic lower bound on the fundamental cache storage vs. transmission rate tradeoff is developed, which strictly improves upon the best known existing bounds.

...read moreread less

Abstract: Caching is a viable solution for alleviating the severe capacity crunch in modern content centric wireless networks. Parts of popular files are pre-stored in users' cache memories such that at times of heavy demand, users can be served locally from their cache content thereby reducing the peak network load. In this work, we consider a central server assisted caching network where files are jointly delivered to users through multicast transmissions. For such a network, we develop a new information theoretic lower bound on the fundamental cache storage vs. transmission rate tradeoff, which strictly improves upon the best known existing bounds. The new bounds are used to establish the approximate storage vs. rate tradeoff of centralized caching to within a constant multiplicative factor of 8.

...read moreread less

131 citations

Proceedings Article•DOI•

Coordinated static and dynamic cache bypassing for GPUs

[...]

Xiaolong Xie¹, Yun Liang¹, Yu Wang², Guangyu Sun¹, Tao Wang¹ - Show less +1 more•Institutions (2)

Peking University¹, Tsinghua University²

09 Mar 2015

TL;DR: In this paper, a coordinated static and dynamic cache bypassing technique is proposed to improve application performance by identifying the global loads that indicate strong preferences for caching or bypassing through profiling.

...read moreread less

Abstract: The massive parallel architecture enables graphics processing units (GPUs) to boost performance for a wide range of applications. Initially, GPUs only employ scratchpad memory as on-chip memory. Recently, to broaden the scope of applications that can be accelerated by GPUs, GPU vendors have used caches in conjunction with scratchpad memory as on-chip memory in the new generations of GPUs. Unfortunately, GPU caches face many performance challenges that arise due to excessive thread contention for cache resource. Cache bypassing, where memory requests can selectively bypass the cache, is one solution that can help to mitigate the cache resource contention problem. In this paper, we propose coordinated static and dynamic cache bypassing to improve application performance. At compile-time, we identify the global loads that indicate strong preferences for caching or bypassing through profiling. For the rest global loads, our dynamic cache bypassing has the flexibility to cache only a fraction of threads. In CUDA programming model, the threads are divided into work units called thread blocks. Our dynamic bypassing technique modulates the ratio of thread blocks that cache or bypass at run-time. We choose to modulate at thread block level in order to avoid the memory divergence problems. Our approach combines compile-time analysis that determines the cache or bypass preferences for global loads with run-time management that adjusts the ratio of thread blocks that cache or bypass. Our coordinated static and dynamic cache bypassing technique achieves up to 2.28X (average I.32X) performance speedup for a variety of GPU applications.

...read moreread less

129 citations

Proceedings Article•DOI•

Locality-Driven Dynamic GPU Cache Bypassing

[...]

Chao Li¹, Shuaiwen Leon Song², Hongwen Dai¹, Albert Sidelnik³, Siva Kumar Sastry Hari³, Huiyang Zhou¹ - Show less +2 more•Institutions (3)

North Carolina State University¹, Pacific Northwest National Laboratory², Nvidia³

08 Jun 2015

TL;DR: This paper presents a design that integrates locality filtering based on reuse characteristics of GPU workloads into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions.

...read moreread less

Abstract: This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of simultaneous requests from single-instruction multiple-thread (SIMT) cores makes the limited capacity of L1 D-caches a performance and energy bottleneck, especially for memory-intensive applications. We observe that the memory access streams to L1 D-caches for many applications contain a significant amount of requests with low reuse, which greatly reduce the cache efficacy. Existing GPU cache management schemes are either based on conditional/reactive solutions or hit-rate based designs specifically developed for CPU last level caches, which can limit overall performance. To overcome these challenges, we propose an efficient locality monitoring mechanism to dynamically filter the access stream on cache insertion such that only the data with high reuse and short reuse distances are stored in the L1 D-cache. Specifically, we present a design that integrates locality filtering based on reuse characteristics of GPU workloads into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions. Results show that our proposed design can dramatically reduce cache contention and achieve up to 56.8% and an average of 30.3% performance improvement over the baseline architecture, for a range of highly-optimized cache-unfriendly applications with minor area overhead and better energy efficiency. Our design also significantly outperforms the state-of-the-art CPU and GPU bypassing schemes (especially for irregular applications), without generating extra L2 and DRAM level contention.

...read moreread less

109 citations

Proceedings Article•DOI•

Doppelgänger: a cache for approximate computing

[...]

Joshua San Miguel¹, Jorge Albericio¹, Andreas Moshovos¹, Natalie Enright Jerger¹•Institutions (1)

University of Toronto¹

05 Dec 2015

TL;DR: The Doppelganger cache associates the tags of multiple similar blocks with a single data array entry to reduce the amount of data stored and achieves reductions in LLC area, dynamic energy and leakage energy without harming performance nor incurring high application error.

...read moreread less

Abstract: Modern processors contain large last level caches (LLCs) that consume substantial energy and area yet are imperative for high performance. Cache designs have improved dramatically by considering reference locality. Data values are also a source of optimization. Compression and deduplication exploit data values to use cache storage more efficiently resulting in smaller caches without sacrificing performance. In multi-megabyte LLCs, many identical or similar values may be cached across multiple blocks simultaneously. This redundancy effectively wastes cache capacity. We observe that a large fraction of cache values exhibit approximate similarity. More specifically, values across cache blocks are not identical but are similar. Coupled with approximate computing which observes that some applications can tolerate error or inexactness, we leverage approximate similarity to design a novel LLC architecture: the Doppelganger cache. The Doppelganger cache associates the tags of multiple similar blocks with a single data array entry to reduce the amount of data stored. Our design achieves 1.55×, 2.55× and 1.41 × reductions in LLC area, dynamic energy and leakage energy without harming performance nor incurring high application error.

...read moreread less

107 citations

Proceedings Article•DOI•

BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches

[...]

Chiachen Chou¹, Aamer Jaleel², Moinuddin K. Qureshi¹•Institutions (2)

Georgia Institute of Technology¹, Nvidia²

13 Jun 2015

TL;DR: Bandwidth Efficient ARchitecture (BEAR) for DRAM caches integrates three components, one each for reducing the bandwidth consumed by miss detection, miss fill, and writeback probes, and reduces the bandwidth consumption of DRAM cache by 32%, which reduces cache hit latency by 24% and increases overall system performance by 10%.

...read moreread less

Abstract: Die stacking memory technology can enable gigascale DRAM caches that can operate at 4x-8x higher bandwidth than commodity DRAM. Such caches can improve system performance by servicing data at a faster rate when the requested data is found in the cache, potentially increasing the memory bandwidth of the system by 4x-8x. Unfortunately, a DRAM cache uses the available memory bandwidth not only for data transfer on cache hits, but also for other secondary operations such as cache miss detection, fill on cache miss, and writeback lookup and content update on dirty evictions from the last-level on-chip cache. Ideally, we want the bandwidth consumed for such secondary operations to be negligible, and have almost all the bandwidth be available for transfer of useful data from the DRAM cache to the processor. We evaluate a 1GB DRAM cache, architected as Alloy Cache, and show that even the most bandwidth-efficient proposal for DRAM cache consumes 3.8x bandwidth compared to an idealized DRAM cache that does not consume any bandwidth for secondary operations. We also show that redesigning the DRAM cache to minimize the bandwidth consumed by secondary operations can potentially improve system performance by 22%. To that end, this paper proposes Bandwidth Efficient ARchitecture (BEAR) for DRAM caches. BEAR integrates three components, one each for reducing the bandwidth consumed by miss detection, miss fill, and writeback probes. BEAR reduces the bandwidth consumption of DRAM cache by 32%, which reduces cache hit latency by 24% and increases overall system performance by 10%. BEAR, with negligible overhead, outperforms an idealized SRAM Tag-Store design that incurs an unacceptable overhead of 64 megabytes, as well as Sector Cache designs that incur an SRAM storage overhead of 6 megabytes.

...read moreread less

86 citations

Proceedings Article•DOI•

Priority-based cache allocation in throughput processors

[...]

Dong Li¹, Minsoo Rhu¹, Daniel R. Johnson², Mike O'Connor¹, Mattan Erez¹, Doug Burger³, Donald S. Fussell¹, Stephen W. Redder¹ - Show less +4 more•Institutions (3)

University of Texas at Austin¹, Nvidia², Microsoft³

09 Mar 2015

TL;DR: A priority-based cache allocation (PCAL) that provides preferential cache capacity to a subset of high-priority threads while simultaneously allowing lower priority threads to execute without contending for the cache is proposed.

...read moreread less

Abstract: GPUs employ massive multithreading and fast context switching to provide high throughput and hide memory latency. Multithreading can Increase contention for various system resources, however, that may result In suboptimal utilization of shared resources. Previous research has proposed variants of throttling thread-level parallelism to reduce cache contention and improve performance. Throttling approaches can, however, lead to under-utilizing thread contexts, on-chip interconnect, and off-chip memory bandwidth. This paper proposes to tightly couple the thread scheduling mechanism with the cache management algorithms such that GPU cache pollution is minimized while off-chip memory throughput is enhanced. We propose priority-based cache allocation (PCAL) that provides preferential cache capacity to a subset of high-priority threads while simultaneously allowing lower priority threads to execute without contending for the cache. By tuning thread-level parallelism while both optimizing caching efficiency as well as other shared resource usage, PCAL builds upon previous thread throttling approaches, improving overall performance by an average 17% with maximum 51%.

...read moreread less

84 citations

Proceedings Article•DOI•

A fully associative, tagless DRAM cache

[...]

Yong-Jun Lee¹, Jong-Won Kim¹, Hakbeom Jang¹, Hyung Gyun Yang², Jangwoo Kim², Jinkyu Jeong¹, Jae W. Lee¹ - Show less +3 more•Institutions (2)

Sungkyunkwan University¹, Pohang University of Science and Technology²

13 Jun 2015

TL;DR: By completely eliminating data structures for cache tag management, from either on-die SRAM or inpackage DRAM, the proposed DRAM cache achieves best scalability and hit latency, while maintaining high hit rate of a fully associative cache.

...read moreread less

Abstract: This paper introduces a tagless cache architecture for large in-package DRAM caches. The conventional die-stacked DRAM cache has both a TLB and a cache tag array, which are responsible for virtual-to-physical and physical-to-cache address translation, respectively. We propose to align the granularity of caching with OS page size and take a unified approach to address translation and cache tag management. To this end, we introduce cache-map TLB (cTLB), which stores virtual-to-cache, instead of virtual-to-physical, address mappings. At a TLB miss, the TLB miss handler allocates the requested block into the cache if it is not cached yet, and updates both the page table and cTLB with the virtual-to-cache address mapping. Assuming the availability of large in-package DRAM caches, this ensures that an access to the memory region within the TLB reach always hits in the cache with low hit latency since a TLB access immediately returns the exact location of the requested block in the cache, hence saving a tag-checking operation. The remaining cache space is used as victim cache for memory pages that are recently evicted from cTLB. By completely eliminating data structures for cache tag management, from either on-die SRAM or in-package DRAM, the proposed DRAM cache achieves best scalability and hit latency, while maintaining high hit rate of a fully associative cache. Our evaluation with 3D Through-Silicon Via (TSV)-based in-package DRAM demonstrates that the proposed cache improves the IPC and energy efficiency by 30.9% and 39.5%, respectively, compared to the baseline with no DRAM cache. These numbers translate to 4.3% and 23.8% improvements over an impractical SRAM-tag cache requiring megabytes of on-die SRAM storage, due to low hit latency and zero energy waste for cache tags.

...read moreread less

83 citations

Patent•

Systems and methods for a file-level cache

[...]

Vikram Joshi, Yang Luan, Michael F. Brown, Hrishikesh A. Vidwans¹•Institutions (1)

SanDisk¹

31 Mar 2015

TL;DR: A multi-level cache comprises a plurality of cache levels, each configured to cache I/O request data pertaining to requests of a different respective type and/or granularity.

...read moreread less

Abstract: A multi-level cache comprises a plurality of cache levels, each configured to cache I/O request data pertaining to I/O requests of a different respective type and/or granularity. The multi-level cache may comprise a file-level cache that is configured to cache I/O request data at a file-level of granularity. A file-level cache policy may comprise file selection criteria to distinguish cacheable files from non-cacheable files. The file-level cache may monitor I/O requests within a storage stage, and may service I/O requests from a cache device.

...read moreread less

Proceedings Article•DOI•

Exploiting compressed block size as an indicator of future reuse

[...]

Gennady Pekhimenko¹, Tyler J. Huberty¹, Rui Cai¹, Onur Mutlu¹, Phillip B. Gibbons², Michael Kozuch², Todd C. Mowry¹ - Show less +3 more•Institutions (2)

Carnegie Mellon University¹, Intel²

09 Mar 2015

TL;DR: A set of new Compression-Aware Management Policies (CAMP) for on-chip caches that employ data compression and a new insertion policy called Size-based Insertion Policy (SIP) that dynamically prioritizes cache blocks using their compressed size as an indicator.

...read moreread less

Abstract: We introduce a set of new Compression-Aware Management Policies (CAMP) for on-chip caches that employ data compression. Our management policies are based on two key ideas. First, we show that it is possible to build a more efficient management policy for compressed caches if the compressed block size is directly used in calculating the value (importance) of a block to the cache. This leads to Minimal-Value Eviction (MVE), a policy that evicts the cache blocks with the least value, based on both the size and the expected future reuse. Second, we show that, in some cases, compressed block size can be used as an efficient indicator of the future reuse of a cache block. We use this idea to build a new insertion policy called Size-based Insertion Policy (SIP) that dynamically prioritizes cache blocks using their compressed size as an indicator. We compare CAMP (and its global variant G-CAMP) to prior on-chip cache management policies (both size-oblivious and size-aware) and find that our mechanisms are more effective in using compressed block size as an extra dimension in cache management decisions. Our results show that the proposed management policies (i) decrease off-chip bandwidth consumption (by 8.7% in single-core), (ii) decrease memory subsystem energy consumption (by 7.2% in single-core) for memory intensive workloads compared to the best prior mechanism, and (iii) improve performance (by 4.9%/9.0%/10.2% on average in single-/two-/four-cor e workload evaluations and up to 20.1%) CAMP is effective for a variety of compression algorithms and different cache designs with local and global replacement strategies.

...read moreread less

Proceedings Article•DOI•

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

[...]

Rachata Ausavarungnirun¹, Saugata Ghose¹, Onur Kayiran², Onur Kayiran³, Gabriel H. Loh², Chita R. Das³, Mahmut Kandemir³, Onur Mutlu¹ - Show less +4 more•Institutions (3)

Carnegie Mellon University¹, Advanced Micro Devices², Pennsylvania State University³

18 Oct 2015

TL;DR: This work proposes a set of techniques, collectively called Memory Divergence Correction (MeDiC), that reduce the negative performance impact of memory divergence and cache queuing, and finds that it delivers an average speedup and higher energy efficiency over a state-of-the-art GPU cache management mechanism across 15 different GPGPU applications.

...read moreread less

Abstract: In a GPU, all threads within a warp execute the same instruction in lockstep. For a memory instruction, this can lead to memory divergence: the memory requests for some threads are serviced early, while the remaining requests incur long latencies. This divergence stalls the warp, as it cannot execute the next instruction until all requests from the current instruction complete. In this work, we make three new observations. First, GPGPU warps exhibit heterogeneous memory divergence behavior at the shared cache: some warps have most of their requests hit in the cache (high cache utility), while other warps see most of their request miss (low cache utility). Second, a warp retains the same divergence behavior for long periods of execution. Third, due to high memory level parallelism, requests going to the shared cache can incur queuing delays as large as hundreds of cycles, exacerbating the effects of memory divergence. We propose a set of techniques, collectively called Memory Divergence Correction (MeDiC), that reduce the negative performance impact of memory divergence and cache queuing. MeDiC uses warp divergence characterization to guide three components: (1) a cache bypassing mechanism that exploits the latency tolerance of low cache utility warps to both alleviate queuing delay and increase the hit rate for high cache utility warps, (2) a cache insertion policy that prevents data from highcache utility warps from being prematurely evicted, and (3) a memory controller that prioritizes the few requests received from high cache utility warps to minimize stall time. We compare MeDiC to four cache management techniques, and find that it delivers an average speedup of 21.8%, and 20.1% higher energy efficiency, over a state-of-the-art GPU cache management mechanism across 15 different GPGPU applications.

...read moreread less

Proceedings Article•DOI•

Talus: A simple way to remove cliffs in cache performance

[...]

Nathan Beckmann¹, Daniel Sanchez¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Feb 2015

TL;DR: Talus works by dividing a single application's access stream into two partitions, unlike prior work that partitions among competing applications, and ensures that as an application is given more cache space, its miss rate decreases in a convex fashion.

...read moreread less

Abstract: Caches often suffer from performance cliffs: minor changes in program behavior or available cache space cause large changes in miss rate. Cliffs hurt performance and complicate cache management. We present Talus,1 a simple scheme that removes these cliffs. Talus works by dividing a single application's access stream into two partitions, unlike prior work that partitions among competing applications. By controlling the sizes of these partitions, Talus ensures that as an application is given more cache space, its miss rate decreases in a convex fashion. We prove that Talus removes performance cliffs, and evaluate it through extensive simulation. Talus adds negligible overheads, improves single-application performance, simplifies partitioning algorithms, and makes cache partitioning more effective and fair.

...read moreread less

Posted Content•

Coded Caching with Heterogenous Cache Sizes

[...]

Sinong Wang, Wenxin Li, Xiaohua Tian, Hui Liu

05 Apr 2015-arXiv: Information Theory

TL;DR: This work investigates the coded caching scheme under the heterogenous cache sizes and proposes a new approach called "smart caching" that addresses the challenge of heterogeneity in the size of the caches.

...read moreread less

Abstract: We investigate the coded caching scheme under the heterogenous cache sizes.

...read moreread less

Proceedings Article•DOI•

Transient and Steady-state Regime of a Family of List-based Cache Replacement Algorithms

[...]

Nicolas Gast¹, Benny Van Houdt²•Institutions (2)

French Institute for Research in Computer Science and Automation¹, University of Antwerp²

15 Jun 2015

TL;DR: This paper disproves the well-known conjecture that the CLIMB algorithm is the optimal finite-memory replacement algorithm under the IRM model and provides guidelines on how to select a replacement algorithm within the family considered such that a good trade-off is achieved between the cache reactivity and its steady-state hit probability.

...read moreread less

Abstract: In this paper we study the performance of a family of cache replacement algorithms. The cache is decomposed into lists. Items enter the cache via the first list. An item enters the cache via the first list and jumps to the next list whenever a hit on it occurs. The classical policies FIFO, RANDOM, CLIMB and its hybrids are obtained as special cases. We present explicit expressions for the cache content distribution and miss probability under the IRM model. We develop an algorithm with a time complexity that is polynomial in the cache size and linear in the number of items to compute the exact miss probability. We introduce lower and upper bounds on the latter that can be computed in a time that is linear in the cache size times the number of items. We further introduce a mean field model to approximate the transient behavior of the miss probability and prove that this model becomes exact as the cache size and number of items tends to infinity. We show that the set of ODEs associated to the mean field model has a unique fixed point that can be used to approximate the miss probability in case the exact computation becomes too time consuming. Using this approximation, we provide guidelines on how to select a replacement algorithm within the family considered such that a good trade-off is achieved between the cache reactivity and its steady-state hit probability. We simulate these cache replacement algorithms on traces of real data and show that they can outperform LRU. Finally, we also disprove the well-known conjecture that the CLIMB algorithm is the optimal finite-memory replacement algorithm under the IRM model.

...read moreread less

Proceedings Article•DOI•

Adaptive GPU cache bypassing

[...]

Yingying Tian¹, Sooraj Puthoor², Joseph L. Greathouse², Bradford M. Beckmann², Daniel A. Jimenez¹ - Show less +1 more•Institutions (2)

Texas A&M University¹, Advanced Micro Devices²

07 Feb 2015

TL;DR: A GPU cache management technique that adaptively bypasses the GPU cache for blocks that are unlikely to be referenced again before being evicted, resulting in better performance for programs that do not use programmer-managed scratchpad memories.

...read moreread less

Abstract: Modern graphics processing units (GPUs) include hardware- controlled caches to reduce bandwidth requirements and energy consumption. However, current GPU cache hierarchies are inefficient for general purpose GPU (GPGPU) comput- ing. GPGPU workloads tend to include data structures that would not fit in any reasonably sized caches, leading to very low cache hit rates. This problem is exacerbated by the design of current GPUs, which share small caches be- tween many threads. Caching these streaming data struc- tures needlessly burns power while evicting data that may otherwise fit into the cache. We propose a GPU cache management technique to im- prove the efficiency of small GPU caches while further re- ducing their power consumption. It adaptively bypasses the GPU cache for blocks that are unlikely to be referenced again before being evicted. This technique saves energy by avoid- ing needless insertions and evictions while avoiding cache pollution, resulting in better performance. We show that, with a 16KB L1 data cache, dynamic bypassing achieves sim- ilar performance to a double-sized L1 cache while reducing energy consumption by 25% and power by 18%. The technique is especially interesting for programs that do not use programmer-managed scratchpad memories. We give a case study to demonstrate the inefficiency of current GPU caches compared to programmer-managed scratchpad memories and show the extent to which cache bypassing can make up for the potential performance loss where the effort to program scratchpad memories is impractical.

...read moreread less

Journal Article•DOI•

An ANFIS-based cache replacement method for mitigating cache pollution attacks in Named Data Networking

[...]

Amin Karami¹, Manel Guerrero-Zapata¹•Institutions (1)

Polytechnic University of Catalonia¹

07 Apr 2015-Computer Networks

TL;DR: A new cache replacement method based on Adaptive Neuro-Fuzzy Inference System (ANFIS) is presented to mitigate the cache pollution attacks in NDN and mitigates them efficiently without very much computational cost as compared to the most common policies.

...read moreread less

Journal Article•DOI•

An Efficient Compiler Framework for Cache Bypassing on GPUs

[...]

Yun Liang¹, Xiaolong Xie¹, Guangyu Sun¹, Deming Chen²•Institutions (2)

Peking University¹, University of Illinois at Urbana–Champaign²

21 Apr 2015-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: An efficient compiler framework for cache bypassing on GPUs is proposed and efficient algorithms that judiciously select global load instructions for cache access or bypass are presented.

...read moreread less

Abstract: Graphics processing units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is not ideal for those general purpose applications with irregular memory accesses. Hence, GPU vendors have introduced caches in conjunction with scratchpad memory in the recent generations of GPUs. The caches on GPUs are highly configurable. The programmer or compiler can explicitly control cache access or bypass for global load instructions. This highly configurable feature of GPU caches opens up the opportunities for optimizing the cache performance. In this paper, we propose an efficient compiler framework for cache bypassing on GPUs. Our objective is to efficiently utilize the configurable cache and improve the overall performance for general purpose GPU applications. In order to achieve this goal, we first characterize GPU cache utilization and develop performance metrics to estimate the cache reuses and memory traffic. Next, we present efficient algorithms that judiciously select global load instructions for cache access or bypass. Finally, we present techniques to explore the unified cache and shared memory design space. We integrate our techniques into an automatic compiler framework that leverages parallel thread execution instruction set architecture to enable cache bypassing for GPUs. Experiments evaluation on NVIDIA GTX680 using a variety of applications demonstrates that compared to cache-all and bypass-all solutions, our techniques improve the performance from 4.6% to 13.1% for 16 KB L1 cache.

...read moreread less

Proceedings Article•DOI•

Cache Coherence Protocol and Memory Performance of the Intel Haswell-EP Architecture

[...]

Daniel Molka, Daniel Hackenberg, Robert Schöne, Wolfgang E. Nagel

01 Sep 2015

TL;DR: This work has developed sophisticated benchmarks that allow for in-depth investigations with full memory location and coherence state control of the Intel Has well-EP micro-architecture, including important memory latency and bandwidth characteristics as well as the cost of core-to-core transfers.

...read moreread less

Abstract: A major challenge in the design of contemporary microprocessors is the increasing number of cores in conjunction with the persevering need for cache coherence. To achieve this, the memory subsystem steadily gains complexity that has evolved to levels beyond comprehension of most application performance analysts. The Intel Has well-EP architecture is such an example. It includes considerable advancements regarding memory hierarchy, on-chip communication, and cache coherence mechanisms compared to the previous generation. We have developed sophisticated benchmarks that allow us to perform in-depth investigations with full memory location and coherence state control. Using these benchmarks we investigate performance data and architectural properties of the Has well-EP micro-architecture, including important memory latency and bandwidth characteristics as well as the cost of core-to-core transfers. This allows us to further the understanding of such complex designs by documenting implementation details the are either not publicly available at all, or only indirectly documented through patents.

...read moreread less

Patent•

Redundant, fault-tolerant, distributed remote procedure call cache in a storage system

[...]

John Hayes, Robert Lee, Peter Vajgel, Joshua Robinson

27 Apr 2015

TL;DR: In this article, a method of operating a remote procedure call cache in a storage cluster is provided, which includes mirroring the Remote Procedure Call cache of the first storage node in a mirrored Remote Procedure call cache of a second storage node.

...read moreread less

Abstract: A method of operating a remote procedure call cache in a storage cluster is provided. The method includes receiving a remote procedure call at a first storage node having solid-state memory and writing information, relating to the remote procedure call, to a remote procedure call cache of the first storage node. The method includes mirroring the remote procedure call cache of the first storage node in a mirrored remote procedure call cache of a second storage node. A plurality of storage nodes and a storage cluster are also provided.

...read moreread less

Journal Article•DOI•

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks

[...]

Vivek Seshadri¹, Samihan Yedkar¹, Hongyi Xin¹, Onur Mutlu¹, Phillip B. Gibbons², Michael Kozuch², Todd C. Mowry¹ - Show less +3 more•Institutions (2)

Carnegie Mellon University¹, Intel²

09 Jan 2015-ACM Transactions on Architecture and Code Optimization

TL;DR: Evaluations show that the final mechanism, which combines these two ideas, significantly improves performance compared to both the baseline LRU policy and two state-of-the-art approaches to mitigating prefetcher-caused cache pollution.

...read moreread less

Abstract: Many modern high-performance processors prefetch blocks into the on-chip cache. Prefetched blocks can potentially pollute the cache by evicting more useful blocks. In this work, we observe that both accurate and inaccurate prefetches lead to cache pollution, and propose a comprehensive mechanism to mitigate prefetcher-caused cache pollution.First, we observe that over 95p of useful prefetches in a wide variety of applications are not reused after the first demand hit (in secondary caches). Based on this observation, our first mechanism simply demotes a prefetched block to the lowest priority on a demand hit. Second, to address pollution caused by inaccurate prefetches, we propose a self-tuning prefetch accuracy predictor to predict if a prefetch is accurate or inaccurate. Only predicted-accurate prefetches are inserted into the cache with a high priority.Evaluations show that our final mechanism, which combines these two ideas, significantly improves performance compared to both the baseline LRU policy and two state-of-the-art approaches to mitigating prefetcher-caused cache pollution (up to 49p, and 6p on average for 157 two-core multiprogrammed workloads). The performance improvement is consistent across a wide variety of system configurations.

...read moreread less

Patent•

Application aware cache coherency

[...]

Assaf Natanzon, Brian Lake

27 Mar 2015

TL;DR: In this article, a distributed processing system includes a first site and a second site, each containing at least one device having cache storage, nonvolatile storage, where data in the cache storage of the first site is no longer accessed by the process, the data being read into the cache of the second site in response to the process accessing data in non-volatile memory of the one site prior to being moved to the other site.

...read moreread less

Abstract: A distributed processing system includes a first site and a second site, each containing at least one device having cache storage, nonvolatile storage, where, in response to moving a process running on the processor of the first site to the processor running on the second site, data in the cache storage of the first site is no longer accessed by the process, the data being read into the cache of the storage of the first site in response to the process accessing data in the non-volatile memory of the first site prior to being moved to the second site. A process running on the processor of the first site moving to the processor running on the second site and corresponding cache slots may be detected by parsing the VMFS containing virtual machine disks used by the process.

...read moreread less

Proceedings Article•DOI•

Optimal Cache Partition-Sharing

[...]

Jacob Brock¹, Chencheng Ye¹, Chen Ding¹, Yechen Li², Xiaolin Wang², Yingwei Luo² - Show less +2 more•Institutions (2)

University of Rochester¹, Peking University²

01 Sep 2015

TL;DR: The theory shows that theproblem of partition-sharing is reducible to the problem of partitioning, and the technique uses dynamic programming to optimize partitioning for overall miss ratio, and for two different kinds of fairness.

...read moreread less

Abstract: When a cache is shared by multiple cores, its space may be allocated either by sharing, partitioning, or both. We call the last case partition-sharing. This paper studies partition-sharing as a general solution, and presents a theory an technique for optimizing partition-sharing. We present a theory and a technique to optimize partition sharing. The theory shows that the problem of partition-sharing is reducible to the problem of partitioning. The technique uses dynamic programming to optimize partitioning for overall miss ratio, and for two different kinds of fairness. Finally, the paper evaluates the effect of optimal cache sharing and compares it with conventional solutions for thousands of 4-program co-run groups, with nearly 180 million different ways to share the cache by each co-run group. Optimal partition-sharing is on average 26% better than free-for-all sharing, and 98% better than equal partitioning. We also demonstrate the trade-off between optimal partitioning and fair partitioning.

...read moreread less

Proceedings Article•DOI•

Fundamental limits of caching in D2D networks with secure delivery

[...]

Zohaib Hassan Awan¹, Aydin Sezgin¹•Institutions (1)

Ruhr University Bochum¹

08 Jun 2015

TL;DR: This work establishes a coding scheme that not only conforms to the demands of all users but also delivers the contents securely and illustrates that for large number of files and users, the loss incurred due to the imposed secrecy constraints is insignificant.

...read moreread less

Abstract: We study the problem of secure transmission over a caching D2D network In this model, end users can prefetch a part of popular contents in their local cache Users make arbitrary requests from the library of available files and interact with each other to deliver requested contents from the local cache to jointly satisfy their demands The transmission between the users is wiretapped by an external eavesdropper from whom the communication needs to be kept secret For this model, by exploiting the flexibility offered by the local cache storage, we establish a coding scheme that not only conforms to the demands of all users but also delivers the contents securely In comparison to the insecure caching schemes, the coding scheme that we develop in this work illustrates that for large number of files and users, the loss incurred due to the imposed secrecy constraints is insignificant We illustrate our result with the help of some examples

...read moreread less

Patent•

Cache management in RDMA distributed key/value stores based on atomic operations

[...]

Michel H. T. Hack¹, Yufei Ren¹, Yandong Wang¹, Li Zhang¹•Institutions (1)

IBM¹

16 Oct 2015

TL;DR: In this paper, a cache management system performs cache management in a Remote Direct Memory Access (RDMA) key value data store, and determines a popularity of the data item based on a frequency at which the data location is accessed by at least one client.

...read moreread less

Abstract: A cache management system performs cache management in a Remote Direct Memory Access (RDMA) key value data store. The cache management system receives a request from at least one client configured to access a data item stored in a data location of a remote server, and determines a popularity of the data item based on a frequency at which the data location is accessed by the at least one client. The system is further configured to determine a lease period of the data item based on the frequency and assigning the lease period to the data location.

...read moreread less

Proceedings Article•DOI•

Measurement and Analysis of Mobile Web Cache Performance

[...]

Yun Ma¹, Xuanzhe Liu¹, Shuhui Zhang¹, Ruirui Xiang¹, Yunxin Liu², Tao Xie³ - Show less +2 more•Institutions (3)

Peking University¹, Microsoft², University of Illinois at Urbana–Champaign³

18 May 2015

TL;DR: This paper builds a new cache analysis model and study the upper bound of how high percentage of resources could potentially be cached and how effective the caching works in practice, and identifies two major problems -- Redundant Transfer and Miscached Resource, which lead to unsatisfactory cache performance.

...read moreread less

Abstract: The Web browser is a killer app on mobile devices such as smartphones. However, the user experience of mobile Web browsing is undesirable because of the slow resource loading. To improve the performance of Web resource loading, caching has been adopted as a key mechanism. However, the existing passive measurement studies cannot comprehensively characterize the performance of mobile Web caching. For example, most of these studies mainly focus on client-side implementations but not server-side configurations, suffer from biased user behaviors, and fail to study "miscached" resources. To address these issues, in this paper, we present a proactive approach for a comprehensive measurement study on mobile Web cache performance. The key idea of our approach is to proactively crawl resources from hundreds of websites periodically with a fine-grained time interval. Thus, we are able to uncover the resource update history and cache configurations at the server side, and analyze the cache performance in various time granularities. Based on our collected data, we build a new cache analysis model and study the upper bound of how high percentage of resources could potentially be cached and how effective the caching works in practice. We report detailed analysis results of different websites and various types of Web resources, and identify the problems caused by unsatisfactory cache performance. In particular, we identify two major problems -- Redundant Transfer and Miscached Resource, which lead to unsatisfactory cache performance. We investigate three main root causes: Same Content, Heuristic Expiration, and Conservative Expiration Time, and discuss what mobile Web developers can do to mitigate those problems.

...read moreread less

Proceedings Article•DOI•

High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches

[...]

Aamer Jaleel¹, Joseph Nuzman¹, Adrian C. Moga¹, Simon C. Steely¹, Joel Emer² - Show less +1 more•Institutions (2)

Intel¹, Massachusetts Institute of Technology²

09 Mar 2015

TL;DR: This paper investigates increasing the size of smaller private caches in the hierarchy as opposed to increasing the shared LLC to improve average cache access latency for workloads whose working set fits into the larger private cache while retaining the benefits of a shared LLC.

...read moreread less

Abstract: Increasing transistor density enables adding more on-die cache real-estate However, devoting more space to the shared last-level-cache (LLC) causes the memory latency bottleneck to move from memory access latency to shared cache access latency. As such, applications whose working set is larger than the smaller caches spend a large fraction of their execution time on shared cache access latency. To address this problem, this paper investigates increasing the size of smaller private caches in the hierarchy as opposed to increasing the shared LLC. Doing so improves average cache access latency for workloads whose working set fits into the larger private cache while retaining the benefits of a shared LLC. The consequence of increasing the size of private caches is to relax inclusion and build exclusive hierarchies. Thus, for the same total caching capacity, an exclusive cache hierarchy provides better cache access latency. We observe that server workloads benefit tremendously from an exclusive hierarchy with large private caches. This is primarily because large private caches accommodate the large code working-sets of server workloads. For a 16-core CMP, an exclusive cache hierarchy improves server workload performance by 5–12% as compared to an equal capacity inclusive cache hierarchy. The paper also presents directions for further research to maximize performance of exclusive cache hierarchies.

...read moreread less

Proceedings Article•DOI•

On the Analysis of Caches with Pending Interest Tables

[...]

Mostafa Dehghan¹, Bo Jiang¹, Ali Dabirmoghaddam², Don Towsley¹•Institutions (2)

University of Massachusetts Amherst¹, University of California, Santa Cruz²

30 Sep 2015

TL;DR: This paper considers a TTL-based cache with a Pending Interest Table, and analyze the cache hit probability, mean response time perceived by the users, and size of the PIT, among other metrics of interest, and applies the model to analyze traditional caching policies LRU, FIFO, and RANDOM.

...read moreread less

Abstract: Collapsed forwarding has been used in cache systems to reduce the load on servers by aggregating requests for the same content. This technique has made its way into design proposals for the future Internet architecture through a data structure called Pending Interest Table (PIT). A PIT keeps track of interest packets that are received at a cache-router until they are responded to. PITs are considered useful for a variety of reasons e.g., communicating without the knowledge of source and destination, reducing bandwidth usage, better security, etc. Due to the high access frequency to the PIT, it is essential to understand its behavior, and the effect it has on cache performance. In this paper, we consider a TTL-based cache with a Pending Interest Table, and analyze the cache hit probability, mean response time perceived by the users, and size of the PIT, among other metrics of interest. In our analysis, we account for the time it takes for the cache to download a file from the server defined as a random variable. We apply our model to analyze traditional caching policies LRU, FIFO, and RANDOM, and verify the accuracy of our model through numerical simulations.

...read moreread less

Posted Content•

Critical Database Size for Effective Caching

[...]

N Ajaykrishnan¹, Navya S. Prem¹, Vinod M. Prabhakaran², Rahul Vaze²•Institutions (2)

National Institute of Technology, Karnataka¹, Tata Institute of Fundamental Research²

12 Jan 2015-arXiv: Information Theory

TL;DR: It is shown that the effectiveness of caching become small when the number of files becomes comparable to the square of the numberof users, and the question of to what extent caching is effective in reducing the server load is addressed.

...read moreread less

Abstract: Replicating or caching popular content in memories distributed across the network is a technique to reduce peak network loads. Conventionally, the performance gain of caching was thought to result from making part of the requested data available closer to end users. Recently, it has been shown that by using a carefully designed technique to store the contents in the cache and coding across data streams a much more significant gain can be achieved in reducing the network load. Inner and outer bounds on the network load v/s cache memory tradeoff were obtained in (Maddah-Ali and Niesen, 2012). We give an improved outer bound on the network load v/s cache memory tradeoff. We address the question of to what extent caching is effective in reducing the server load when the number of files becomes large as compared to the number of users. We show that the effectiveness of caching become small when the number of files becomes comparable to the square of the number of users.

...read moreread less

Collapse