scispace - formally typeset
Search or ask a question

Showing papers on "Cache pollution published in 2015"


Proceedings ArticleDOI
12 Aug 2015
TL;DR: An automated attack on the T-table-based AES implementation of OpenSSL that is as efficient as state-of-the-art manual cache attacks and can reduce the entropy per character from log2(26) = 4.7 to 1.4 bits on Linux systems is performed.
Abstract: Recent work on cache attacks has shown that CPU caches represent a powerful source of information leakage. However, existing attacks require manual identification of vulnerabilities, i.e., data accesses or instruction execution depending on secret information. In this paper, we present Cache Template Attacks. This generic attack technique allows us to profile and exploit cache-based information leakage of any program automatically, without prior knowledge of specific software versions or even specific system information. Cache Template Attacks can be executed online on a remote system without any prior offline computations or measurements. Cache Template Attacks consist of two phases. In the profiling phase, we determine dependencies between the processing of secret information, e.g., specific key inputs or private keys of cryptographic primitives, and specific cache accesses. In the exploitation phase, we derive the secret values based on observed cache accesses. We illustrate the power of the presented approach in several attacks, but also in a useful application for developers. Among the presented attacks is the application of Cache Template Attacks to infer keystrokes and--even more severe--the identification of specific keys on Linux and Windows user interfaces. More specifically, for lowercase only passwords, we can reduce the entropy per character from log2(26) = 4.7 to 1.4 bits on Linux systems. Furthermore, we perform an automated attack on the T-table-based AES implementation of OpenSSL that is as efficient as state-of-the-art manual cache attacks.

387 citations


Book ChapterDOI
02 Nov 2015
TL;DR: An automatic and generic method for reverse engineering Intel's last-level cache complex addressing, consequently rendering the class of cache attacks highly practical and giving a more precise description of the complex addressing function than previous work.
Abstract: Cache attacks, which exploit differences in timing to perform covert or side channels, are now well understood. Recent works leverage the last level cache to perform cache attacks across cores. This cache is split in slices, with one slice per core. While predicting the slices used by an address is simple in older processors, recent processors are using an undocumented technique called complex addressing. This renders some attacks more difficult and makes other attacks impossible, because of the loss of precision in the prediction of cache collisions. In this paper, we build an automatic and generic method for reverse engineering Intel's last-level cache complex addressing, consequently rendering the class of cache attacks highly practical. Our method relies on CPU hardware performance counters to determine the cache slice an address is mapped to. We show that our method gives a more precise description of the complex addressing function than previous work. We validated our method by reversing the complex addressing functions on a diverse set of Intel processors. This set encompasses Sandy Bridge, Ivy Bridge and Haswell micro-architectures, with different number of cores, for mobile and server ranges of processors. We show the correctness of our function by building a covert channel. Finally, we discuss how other attacks benefit from knowing the complex addressing of a cache, such as sandboxed rowhammer.

161 citations


Journal ArticleDOI
TL;DR: CacheAudit as mentioned in this paper analyzes cache side channels by observing cache states, traces of hits and misses, and execution times, and derives formal, quantitative security guarantees for a comprehensive set of side-channel adversaries.
Abstract: We present CacheAudit, a versatile framework for the automatic, static analysis of cache side channels. CacheAudit takes as input a program binary and a cache configuration and derives formal, quantitative security guarantees for a comprehensive set of side-channel adversaries, namely, those based on observing cache states, traces of hits and misses, and execution times. Our technical contributions include novel abstractions to efficiently compute precise overapproximations of the possible side-channel observations for each of these adversaries. These approximations then yield upper bounds on the amount of information that is revealed.In case studies, we apply CacheAudit to binary executables of algorithms for sorting and encryption, including the AES implementation from the PolarSSL library, and the reference implementations of the finalists of the eSTREAM stream cipher competition. The results we obtain exhibit the influence of cache size, line size, associativity, replacement policy, and coding style on the security of the executables and include the first formal proofs of security for implementations with countermeasures such as preloading and data-independent memory access patterns.

159 citations


Proceedings ArticleDOI
14 Jun 2015
TL;DR: A new information theoretic lower bound on the fundamental cache storage vs. transmission rate tradeoff is developed, which strictly improves upon the best known existing bounds.
Abstract: Caching is a viable solution for alleviating the severe capacity crunch in modern content centric wireless networks. Parts of popular files are pre-stored in users' cache memories such that at times of heavy demand, users can be served locally from their cache content thereby reducing the peak network load. In this work, we consider a central server assisted caching network where files are jointly delivered to users through multicast transmissions. For such a network, we develop a new information theoretic lower bound on the fundamental cache storage vs. transmission rate tradeoff, which strictly improves upon the best known existing bounds. The new bounds are used to establish the approximate storage vs. rate tradeoff of centralized caching to within a constant multiplicative factor of 8.

131 citations


Proceedings ArticleDOI
Xiaolong Xie1, Yun Liang1, Yu Wang2, Guangyu Sun1, Tao Wang1 
09 Mar 2015
TL;DR: In this paper, a coordinated static and dynamic cache bypassing technique is proposed to improve application performance by identifying the global loads that indicate strong preferences for caching or bypassing through profiling.
Abstract: The massive parallel architecture enables graphics processing units (GPUs) to boost performance for a wide range of applications. Initially, GPUs only employ scratchpad memory as on-chip memory. Recently, to broaden the scope of applications that can be accelerated by GPUs, GPU vendors have used caches in conjunction with scratchpad memory as on-chip memory in the new generations of GPUs. Unfortunately, GPU caches face many performance challenges that arise due to excessive thread contention for cache resource. Cache bypassing, where memory requests can selectively bypass the cache, is one solution that can help to mitigate the cache resource contention problem. In this paper, we propose coordinated static and dynamic cache bypassing to improve application performance. At compile-time, we identify the global loads that indicate strong preferences for caching or bypassing through profiling. For the rest global loads, our dynamic cache bypassing has the flexibility to cache only a fraction of threads. In CUDA programming model, the threads are divided into work units called thread blocks. Our dynamic bypassing technique modulates the ratio of thread blocks that cache or bypass at run-time. We choose to modulate at thread block level in order to avoid the memory divergence problems. Our approach combines compile-time analysis that determines the cache or bypass preferences for global loads with run-time management that adjusts the ratio of thread blocks that cache or bypass. Our coordinated static and dynamic cache bypassing technique achieves up to 2.28X (average I.32X) performance speedup for a variety of GPU applications.

129 citations


Proceedings ArticleDOI
08 Jun 2015
TL;DR: This paper presents a design that integrates locality filtering based on reuse characteristics of GPU workloads into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions.
Abstract: This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of simultaneous requests from single-instruction multiple-thread (SIMT) cores makes the limited capacity of L1 D-caches a performance and energy bottleneck, especially for memory-intensive applications. We observe that the memory access streams to L1 D-caches for many applications contain a significant amount of requests with low reuse, which greatly reduce the cache efficacy. Existing GPU cache management schemes are either based on conditional/reactive solutions or hit-rate based designs specifically developed for CPU last level caches, which can limit overall performance. To overcome these challenges, we propose an efficient locality monitoring mechanism to dynamically filter the access stream on cache insertion such that only the data with high reuse and short reuse distances are stored in the L1 D-cache. Specifically, we present a design that integrates locality filtering based on reuse characteristics of GPU workloads into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions. Results show that our proposed design can dramatically reduce cache contention and achieve up to 56.8% and an average of 30.3% performance improvement over the baseline architecture, for a range of highly-optimized cache-unfriendly applications with minor area overhead and better energy efficiency. Our design also significantly outperforms the state-of-the-art CPU and GPU bypassing schemes (especially for irregular applications), without generating extra L2 and DRAM level contention.

109 citations


Proceedings ArticleDOI
05 Dec 2015
TL;DR: The Doppelganger cache associates the tags of multiple similar blocks with a single data array entry to reduce the amount of data stored and achieves reductions in LLC area, dynamic energy and leakage energy without harming performance nor incurring high application error.
Abstract: Modern processors contain large last level caches (LLCs) that consume substantial energy and area yet are imperative for high performance. Cache designs have improved dramatically by considering reference locality. Data values are also a source of optimization. Compression and deduplication exploit data values to use cache storage more efficiently resulting in smaller caches without sacrificing performance. In multi-megabyte LLCs, many identical or similar values may be cached across multiple blocks simultaneously. This redundancy effectively wastes cache capacity. We observe that a large fraction of cache values exhibit approximate similarity. More specifically, values across cache blocks are not identical but are similar. Coupled with approximate computing which observes that some applications can tolerate error or inexactness, we leverage approximate similarity to design a novel LLC architecture: the Doppelganger cache. The Doppelganger cache associates the tags of multiple similar blocks with a single data array entry to reduce the amount of data stored. Our design achieves 1.55×, 2.55× and 1.41 × reductions in LLC area, dynamic energy and leakage energy without harming performance nor incurring high application error.

107 citations


Proceedings ArticleDOI
13 Jun 2015
TL;DR: Bandwidth Efficient ARchitecture (BEAR) for DRAM caches integrates three components, one each for reducing the bandwidth consumed by miss detection, miss fill, and writeback probes, and reduces the bandwidth consumption of DRAM cache by 32%, which reduces cache hit latency by 24% and increases overall system performance by 10%.
Abstract: Die stacking memory technology can enable gigascale DRAM caches that can operate at 4x-8x higher bandwidth than commodity DRAM. Such caches can improve system performance by servicing data at a faster rate when the requested data is found in the cache, potentially increasing the memory bandwidth of the system by 4x-8x. Unfortunately, a DRAM cache uses the available memory bandwidth not only for data transfer on cache hits, but also for other secondary operations such as cache miss detection, fill on cache miss, and writeback lookup and content update on dirty evictions from the last-level on-chip cache. Ideally, we want the bandwidth consumed for such secondary operations to be negligible, and have almost all the bandwidth be available for transfer of useful data from the DRAM cache to the processor. We evaluate a 1GB DRAM cache, architected as Alloy Cache, and show that even the most bandwidth-efficient proposal for DRAM cache consumes 3.8x bandwidth compared to an idealized DRAM cache that does not consume any bandwidth for secondary operations. We also show that redesigning the DRAM cache to minimize the bandwidth consumed by secondary operations can potentially improve system performance by 22%. To that end, this paper proposes Bandwidth Efficient ARchitecture (BEAR) for DRAM caches. BEAR integrates three components, one each for reducing the bandwidth consumed by miss detection, miss fill, and writeback probes. BEAR reduces the bandwidth consumption of DRAM cache by 32%, which reduces cache hit latency by 24% and increases overall system performance by 10%. BEAR, with negligible overhead, outperforms an idealized SRAM Tag-Store design that incurs an unacceptable overhead of 64 megabytes, as well as Sector Cache designs that incur an SRAM storage overhead of 6 megabytes.

86 citations


Proceedings ArticleDOI
09 Mar 2015
TL;DR: A priority-based cache allocation (PCAL) that provides preferential cache capacity to a subset of high-priority threads while simultaneously allowing lower priority threads to execute without contending for the cache is proposed.
Abstract: GPUs employ massive multithreading and fast context switching to provide high throughput and hide memory latency. Multithreading can Increase contention for various system resources, however, that may result In suboptimal utilization of shared resources. Previous research has proposed variants of throttling thread-level parallelism to reduce cache contention and improve performance. Throttling approaches can, however, lead to under-utilizing thread contexts, on-chip interconnect, and off-chip memory bandwidth. This paper proposes to tightly couple the thread scheduling mechanism with the cache management algorithms such that GPU cache pollution is minimized while off-chip memory throughput is enhanced. We propose priority-based cache allocation (PCAL) that provides preferential cache capacity to a subset of high-priority threads while simultaneously allowing lower priority threads to execute without contending for the cache. By tuning thread-level parallelism while both optimizing caching efficiency as well as other shared resource usage, PCAL builds upon previous thread throttling approaches, improving overall performance by an average 17% with maximum 51%.

84 citations


Proceedings ArticleDOI
13 Jun 2015
TL;DR: By completely eliminating data structures for cache tag management, from either on-die SRAM or inpackage DRAM, the proposed DRAM cache achieves best scalability and hit latency, while maintaining high hit rate of a fully associative cache.
Abstract: This paper introduces a tagless cache architecture for large in-package DRAM caches. The conventional die-stacked DRAM cache has both a TLB and a cache tag array, which are responsible for virtual-to-physical and physical-to-cache address translation, respectively. We propose to align the granularity of caching with OS page size and take a unified approach to address translation and cache tag management. To this end, we introduce cache-map TLB (cTLB), which stores virtual-to-cache, instead of virtual-to-physical, address mappings. At a TLB miss, the TLB miss handler allocates the requested block into the cache if it is not cached yet, and updates both the page table and cTLB with the virtual-to-cache address mapping. Assuming the availability of large in-package DRAM caches, this ensures that an access to the memory region within the TLB reach always hits in the cache with low hit latency since a TLB access immediately returns the exact location of the requested block in the cache, hence saving a tag-checking operation. The remaining cache space is used as victim cache for memory pages that are recently evicted from cTLB. By completely eliminating data structures for cache tag management, from either on-die SRAM or in-package DRAM, the proposed DRAM cache achieves best scalability and hit latency, while maintaining high hit rate of a fully associative cache. Our evaluation with 3D Through-Silicon Via (TSV)-based in-package DRAM demonstrates that the proposed cache improves the IPC and energy efficiency by 30.9% and 39.5%, respectively, compared to the baseline with no DRAM cache. These numbers translate to 4.3% and 23.8% improvements over an impractical SRAM-tag cache requiring megabytes of on-die SRAM storage, due to low hit latency and zero energy waste for cache tags.

83 citations


Patent
31 Mar 2015
TL;DR: A multi-level cache comprises a plurality of cache levels, each configured to cache I/O request data pertaining to requests of a different respective type and/or granularity.
Abstract: A multi-level cache comprises a plurality of cache levels, each configured to cache I/O request data pertaining to I/O requests of a different respective type and/or granularity. The multi-level cache may comprise a file-level cache that is configured to cache I/O request data at a file-level of granularity. A file-level cache policy may comprise file selection criteria to distinguish cacheable files from non-cacheable files. The file-level cache may monitor I/O requests within a storage stage, and may service I/O requests from a cache device.

Proceedings ArticleDOI
09 Mar 2015
TL;DR: A set of new Compression-Aware Management Policies (CAMP) for on-chip caches that employ data compression and a new insertion policy called Size-based Insertion Policy (SIP) that dynamically prioritizes cache blocks using their compressed size as an indicator.
Abstract: We introduce a set of new Compression-Aware Management Policies (CAMP) for on-chip caches that employ data compression. Our management policies are based on two key ideas. First, we show that it is possible to build a more efficient management policy for compressed caches if the compressed block size is directly used in calculating the value (importance) of a block to the cache. This leads to Minimal-Value Eviction (MVE), a policy that evicts the cache blocks with the least value, based on both the size and the expected future reuse. Second, we show that, in some cases, compressed block size can be used as an efficient indicator of the future reuse of a cache block. We use this idea to build a new insertion policy called Size-based Insertion Policy (SIP) that dynamically prioritizes cache blocks using their compressed size as an indicator. We compare CAMP (and its global variant G-CAMP) to prior on-chip cache management policies (both size-oblivious and size-aware) and find that our mechanisms are more effective in using compressed block size as an extra dimension in cache management decisions. Our results show that the proposed management policies (i) decrease off-chip bandwidth consumption (by 8.7% in single-core), (ii) decrease memory subsystem energy consumption (by 7.2% in single-core) for memory intensive workloads compared to the best prior mechanism, and (iii) improve performance (by 4.9%/9.0%/10.2% on average in single-/two-/four-cor e workload evaluations and up to 20.1%) CAMP is effective for a variety of compression algorithms and different cache designs with local and global replacement strategies.

Proceedings ArticleDOI
26 Aug 2015
TL;DR: In this paper, the authors present a tool that allows to recover the slice selection algorithm for Intel processors by using cache access information to derive equations that allow the reconstruction of the applied slice selection algorithms.
Abstract: Dividing last level caches into slices is a popular method to prevent memory accesses from becoming a bottleneck on modern multicore processors. In order to assess and understand the benefits of cache slicing in detail, a precise knowledge of implementation details such as the slice selection algorithm are of high importance. However, slice selection methods are mostly unstudied, and processor manufacturers choose not to publish their designs, nor their design rationale. In this paper, we present a tool that allows to recover the slice selection algorithm for Intel processors. The tool uses cache access information to derive equations that allow the reconstruction of the applied slice selection algorithm. Thereby, the tool enables further exploration of the behavior of modern caches. The tool is successfully applied to a range of Intel CPUs with different slices and architectures. Results show that slice selection algorithms have become more complex over time by involving an increasing number of bits of the physical address. We also demonstrate that among the most recent processors, the slice selection algorithm depends on the number of CPU cores rather than the processor model.

Proceedings ArticleDOI
18 Oct 2015
TL;DR: This work proposes a set of techniques, collectively called Memory Divergence Correction (MeDiC), that reduce the negative performance impact of memory divergence and cache queuing, and finds that it delivers an average speedup and higher energy efficiency over a state-of-the-art GPU cache management mechanism across 15 different GPGPU applications.
Abstract: In a GPU, all threads within a warp execute the same instruction in lockstep. For a memory instruction, this can lead to memory divergence: the memory requests for some threads are serviced early, while the remaining requests incur long latencies. This divergence stalls the warp, as it cannot execute the next instruction until all requests from the current instruction complete. In this work, we make three new observations. First, GPGPU warps exhibit heterogeneous memory divergence behavior at the shared cache: some warps have most of their requests hit in the cache (high cache utility), while other warps see most of their request miss (low cache utility). Second, a warp retains the same divergence behavior for long periods of execution. Third, due to high memory level parallelism, requests going to the shared cache can incur queuing delays as large as hundreds of cycles, exacerbating the effects of memory divergence. We propose a set of techniques, collectively called Memory Divergence Correction (MeDiC), that reduce the negative performance impact of memory divergence and cache queuing. MeDiC uses warp divergence characterization to guide three components: (1) a cache bypassing mechanism that exploits the latency tolerance of low cache utility warps to both alleviate queuing delay and increase the hit rate for high cache utility warps, (2) a cache insertion policy that prevents data from highcache utility warps from being prematurely evicted, and (3) a memory controller that prioritizes the few requests received from high cache utility warps to minimize stall time. We compare MeDiC to four cache management techniques, and find that it delivers an average speedup of 21.8%, and 20.1% higher energy efficiency, over a state-of-the-art GPU cache management mechanism across 15 different GPGPU applications.

Proceedings ArticleDOI
01 Feb 2015
TL;DR: Talus works by dividing a single application's access stream into two partitions, unlike prior work that partitions among competing applications, and ensures that as an application is given more cache space, its miss rate decreases in a convex fashion.
Abstract: Caches often suffer from performance cliffs: minor changes in program behavior or available cache space cause large changes in miss rate. Cliffs hurt performance and complicate cache management. We present Talus,1 a simple scheme that removes these cliffs. Talus works by dividing a single application's access stream into two partitions, unlike prior work that partitions among competing applications. By controlling the sizes of these partitions, Talus ensures that as an application is given more cache space, its miss rate decreases in a convex fashion. We prove that Talus removes performance cliffs, and evaluate it through extensive simulation. Talus adds negligible overheads, improves single-application performance, simplifies partitioning algorithms, and makes cache partitioning more effective and fair.

Proceedings ArticleDOI
07 Feb 2015
TL;DR: A GPU cache management technique that adaptively bypasses the GPU cache for blocks that are unlikely to be referenced again before being evicted, resulting in better performance for programs that do not use programmer-managed scratchpad memories.
Abstract: Modern graphics processing units (GPUs) include hardware- controlled caches to reduce bandwidth requirements and energy consumption. However, current GPU cache hierarchies are inefficient for general purpose GPU (GPGPU) comput- ing. GPGPU workloads tend to include data structures that would not fit in any reasonably sized caches, leading to very low cache hit rates. This problem is exacerbated by the design of current GPUs, which share small caches be- tween many threads. Caching these streaming data struc- tures needlessly burns power while evicting data that may otherwise fit into the cache. We propose a GPU cache management technique to im- prove the efficiency of small GPU caches while further re- ducing their power consumption. It adaptively bypasses the GPU cache for blocks that are unlikely to be referenced again before being evicted. This technique saves energy by avoid- ing needless insertions and evictions while avoiding cache pollution, resulting in better performance. We show that, with a 16KB L1 data cache, dynamic bypassing achieves sim- ilar performance to a double-sized L1 cache while reducing energy consumption by 25% and power by 18%. The technique is especially interesting for programs that do not use programmer-managed scratchpad memories. We give a case study to demonstrate the inefficiency of current GPU caches compared to programmer-managed scratchpad memories and show the extent to which cache bypassing can make up for the potential performance loss where the effort to program scratchpad memories is impractical.

Journal ArticleDOI
TL;DR: A new cache replacement method based on Adaptive Neuro-Fuzzy Inference System (ANFIS) is presented to mitigate the cache pollution attacks in NDN and mitigates them efficiently without very much computational cost as compared to the most common policies.

Proceedings ArticleDOI
20 Aug 2015
TL;DR: This paper proposes a software-based mechanism, Blurred Persistence, to blur the volatility-persistence boundary, so as to reduce the overhead in transaction support and improves cache efficiency.
Abstract: Persistent memory provides data persistence at main memory level and enables memory-level storage systems. To ensure consistency of the storage systems, memory writes need to be transactional and are carefully moved across the boundary between the volatile CPU cache and the persistent memory. Unfortunately, the CPU cache is hardware-controlled, and it incurs high overhead for programs to track and move data blocks from being volatile to persistent. In this paper, we propose a software-based mechanism, Blurred Persistence, to blur the volatility-persistence boundary, so as to reduce the overhead in transaction support. Blurred Persistence consists of two techniques. First, Execution in Log executes a transaction in the log to eliminate duplicated data copies for execution. It allows the persistence of volatile uncommitted data, which can be detected by reorganizing the log structure. Second, Volatile Checkpoint with Bulk Persistence allows the committed data to aggressively stay volatile by leveraging the data durability in the log, as long as the commit order across threads is kept. By doing so, it reduces the frequency of forced persistence and improves cache efficiency. Evaluations show that our mechanism improves system performance by 56.3% to 143.7% for a variety of workloads.

Journal ArticleDOI
TL;DR: An efficient compiler framework for cache bypassing on GPUs is proposed and efficient algorithms that judiciously select global load instructions for cache access or bypass are presented.
Abstract: Graphics processing units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is not ideal for those general purpose applications with irregular memory accesses. Hence, GPU vendors have introduced caches in conjunction with scratchpad memory in the recent generations of GPUs. The caches on GPUs are highly configurable. The programmer or compiler can explicitly control cache access or bypass for global load instructions. This highly configurable feature of GPU caches opens up the opportunities for optimizing the cache performance. In this paper, we propose an efficient compiler framework for cache bypassing on GPUs. Our objective is to efficiently utilize the configurable cache and improve the overall performance for general purpose GPU applications. In order to achieve this goal, we first characterize GPU cache utilization and develop performance metrics to estimate the cache reuses and memory traffic. Next, we present efficient algorithms that judiciously select global load instructions for cache access or bypass. Finally, we present techniques to explore the unified cache and shared memory design space. We integrate our techniques into an automatic compiler framework that leverages parallel thread execution instruction set architecture to enable cache bypassing for GPUs. Experiments evaluation on NVIDIA GTX680 using a variety of applications demonstrates that compared to cache-all and bypass-all solutions, our techniques improve the performance from 4.6% to 13.1% for 16 KB L1 cache.

Journal ArticleDOI
TL;DR: Preliminary results show how adapting guard-bands to application characteristics can help the system save significant amount of cache leakage energy while still generating acceptable quality results.
Abstract: While the memory subsystem is already a major contributor to energy consumption of embedded systems, the guard-banding required for masking the effects of ever increasing manufacturing variations in memories imposes even more energy overhead. In this letter, we explore how partially-forgetful memories can be used by exploiting the intrinsic tolerance of a vast class of applications to some level of error for relaxing this guard-banding in memories. We discuss the challenges to be addressed and introduce relaxed cache as an exemplar to address these challenges for partially-forgetful SRAM caches. Preliminary results show how adapting guard-bands to application characteristics can help the system save significant amount of cache leakage energy (up to 74%) while still generating acceptable quality results.

Journal ArticleDOI
TL;DR: The paper describes the new memory-buffer chip, called Centaur, providing up to 128 MB of eDRAM (embedded dynamic random- access memory) buffer cache per processor, along with an improved DRAM (dynamic random-access memory) scheduler with support for prefetch and write optimizations, providing industry-leading memory bandwidth combined with low memory latency.
Abstract: In this paper, we describe the IBM POWER8™ cache, interconnect, memory, and input/output subsystems, collectively referred to as the “nest.” This paper focuses on the enhancements made to the nest to achieve balanced and scalable designs, ranging from small 12-core single-socket systems, up to large 16-processor-socket, 192-core enterprise rack servers. A key aspect of the design has been increasing the end-to-end data and coherence bandwidth of the system, now featuring more than twice the bandwidth of the POWER7® processor. The paper describes the new memory-buffer chip, called Centaur, providing up to 128 MB of eDRAM (embedded dynamic random-access memory) buffer cache per processor, along with an improved DRAM (dynamic random-access memory) scheduler with support for prefetch and write optimizations, providing industry-leading memory bandwidth combined with low memory latency. It also describes new coherence-transport enhancements and the transition to directly integrated PCIe® (PCI Express®) support, as well as additions to the cache subsystem to support higher levels of virtualization and scalability including snoop filtering and cache sharing.

Proceedings ArticleDOI
01 Sep 2015
TL;DR: This work has developed sophisticated benchmarks that allow for in-depth investigations with full memory location and coherence state control of the Intel Has well-EP micro-architecture, including important memory latency and bandwidth characteristics as well as the cost of core-to-core transfers.
Abstract: A major challenge in the design of contemporary microprocessors is the increasing number of cores in conjunction with the persevering need for cache coherence. To achieve this, the memory subsystem steadily gains complexity that has evolved to levels beyond comprehension of most application performance analysts. The Intel Has well-EP architecture is such an example. It includes considerable advancements regarding memory hierarchy, on-chip communication, and cache coherence mechanisms compared to the previous generation. We have developed sophisticated benchmarks that allow us to perform in-depth investigations with full memory location and coherence state control. Using these benchmarks we investigate performance data and architectural properties of the Has well-EP micro-architecture, including important memory latency and bandwidth characteristics as well as the cost of core-to-core transfers. This allows us to further the understanding of such complex designs by documenting implementation details the are either not publicly available at all, or only indirectly documented through patents.

Journal ArticleDOI
TL;DR: Evaluations show that the final mechanism, which combines these two ideas, significantly improves performance compared to both the baseline LRU policy and two state-of-the-art approaches to mitigating prefetcher-caused cache pollution.
Abstract: Many modern high-performance processors prefetch blocks into the on-chip cache. Prefetched blocks can potentially pollute the cache by evicting more useful blocks. In this work, we observe that both accurate and inaccurate prefetches lead to cache pollution, and propose a comprehensive mechanism to mitigate prefetcher-caused cache pollution.First, we observe that over 95p of useful prefetches in a wide variety of applications are not reused after the first demand hit (in secondary caches). Based on this observation, our first mechanism simply demotes a prefetched block to the lowest priority on a demand hit. Second, to address pollution caused by inaccurate prefetches, we propose a self-tuning prefetch accuracy predictor to predict if a prefetch is accurate or inaccurate. Only predicted-accurate prefetches are inserted into the cache with a high priority.Evaluations show that our final mechanism, which combines these two ideas, significantly improves performance compared to both the baseline LRU policy and two state-of-the-art approaches to mitigating prefetcher-caused cache pollution (up to 49p, and 6p on average for 157 two-core multiprogrammed workloads). The performance improvement is consistent across a wide variety of system configurations.

Patent
27 Mar 2015
TL;DR: In this article, a distributed processing system includes a first site and a second site, each containing at least one device having cache storage, nonvolatile storage, where data in the cache storage of the first site is no longer accessed by the process, the data being read into the cache of the second site in response to the process accessing data in non-volatile memory of the one site prior to being moved to the other site.
Abstract: A distributed processing system includes a first site and a second site, each containing at least one device having cache storage, nonvolatile storage, where, in response to moving a process running on the processor of the first site to the processor running on the second site, data in the cache storage of the first site is no longer accessed by the process, the data being read into the cache of the storage of the first site in response to the process accessing data in the non-volatile memory of the first site prior to being moved to the second site. A process running on the processor of the first site moving to the processor running on the second site and corresponding cache slots may be detected by parsing the VMFS containing virtual machine disks used by the process.

Proceedings ArticleDOI
01 Sep 2015
TL;DR: The theory shows that theproblem of partition-sharing is reducible to the problem of partitioning, and the technique uses dynamic programming to optimize partitioning for overall miss ratio, and for two different kinds of fairness.
Abstract: When a cache is shared by multiple cores, its space may be allocated either by sharing, partitioning, or both. We call the last case partition-sharing. This paper studies partition-sharing as a general solution, and presents a theory an technique for optimizing partition-sharing. We present a theory and a technique to optimize partition sharing. The theory shows that the problem of partition-sharing is reducible to the problem of partitioning. The technique uses dynamic programming to optimize partitioning for overall miss ratio, and for two different kinds of fairness. Finally, the paper evaluates the effect of optimal cache sharing and compares it with conventional solutions for thousands of 4-program co-run groups, with nearly 180 million different ways to share the cache by each co-run group. Optimal partition-sharing is on average 26% better than free-for-all sharing, and 98% better than equal partitioning. We also demonstrate the trade-off between optimal partitioning and fair partitioning.

Patent
10 Mar 2015
TL;DR: In this article, a method for controlling a cache having a volatile memory and a non-volatile memory during a power up sequence is provided, which includes receiving, at a controller configured to control the cache and a storage device associated with the cache, a signal indicating whether the non-vatile memory includes dirty data copied from the volatile memory to the nonvivo memory during power down sequence, the dirty data including data that has not been stored in the storage device.
Abstract: In some embodiments, a method for controlling a cache having a volatile memory and a non-volatile memory during a power up sequence is provided. The method includes receiving, at a controller configured to control the cache and a storage device associated with the cache, a signal indicating whether the non-volatile memory includes dirty data copied from the volatile memory to the non-volatile memory during a power down sequence, the dirty data including data that has not been stored in the storage device. In response to the received signal, the dirty data is restored from the non-volatile memory to the volatile memory, and flushed from the volatile memory to the storage device.

Proceedings ArticleDOI
13 Jun 2015
TL;DR: The Load Slice Core extends the efficient in-order, stall-on-use core with a second in- order pipeline that enables memory accesses and address-generating instructions to bypass stalled instructions in the main pipeline, thus providing an alternative direction for future many-core processors.
Abstract: Driven by the motivation to expose instruction-level parallelism (ILP), microprocessor cores have evolved from simple, in-order pipelines into complex, superscalar out-of-order designs. By extracting ILP, these processors also enable parallel cache and memory operations as a useful side-effect. Today, however, the growing off-chip memory wall and complex cache hierarchies of many-core processors make cache and memory accesses ever more costly. This increases the importance of extracting memory hierarchy parallelism (MHP), while reducing the net impact of more general, yet complex and power-hungry ILP-extraction techniques. In addition, for multi-core processors operating in power- and energy-constrained environments, energy-efficiency has largely replaced single-thread performance as the primary concern. Based on this observation, we propose a core microarchitecture that is aimed squarely at generating parallel accesses to the memory hierarchy while maximizing energy efficiency. The Load Slice Core extends the efficient in-order, stall-on-use core with a second in-order pipeline that enables memory accesses and address-generating instructions to bypass stalled instructions in the main pipeline. Backward program slices containing address-generating instructions leading up to loads and stores are extracted automatically by the hardware, using a novel iterative algorithm that requires no software support or recompilation. On average, the Load Slice Core improves performance over a baseline in-order processor by 53% with overheads of only 15% in area and 22% in power, leading to an increase in energy efficiency (MIPS/Watt) over in-order and out-of-order designs by 43% and over 4.7×, respectively. In addition, for a power- and area-constrained many-core design, the Load Slice Core outperforms both in-order and out-of-order designs, achieving a 53% and 95% higher performance, respectively, thus providing an alternative direction for future many-core processors.

Proceedings ArticleDOI
08 Jun 2015
TL;DR: This work establishes a coding scheme that not only conforms to the demands of all users but also delivers the contents securely and illustrates that for large number of files and users, the loss incurred due to the imposed secrecy constraints is insignificant.
Abstract: We study the problem of secure transmission over a caching D2D network In this model, end users can prefetch a part of popular contents in their local cache Users make arbitrary requests from the library of available files and interact with each other to deliver requested contents from the local cache to jointly satisfy their demands The transmission between the users is wiretapped by an external eavesdropper from whom the communication needs to be kept secret For this model, by exploiting the flexibility offered by the local cache storage, we establish a coding scheme that not only conforms to the demands of all users but also delivers the contents securely In comparison to the insecure caching schemes, the coding scheme that we develop in this work illustrates that for large number of files and users, the loss incurred due to the imposed secrecy constraints is insignificant We illustrate our result with the help of some examples

Proceedings ArticleDOI
09 Mar 2015
TL;DR: CiDRA is proposed, a cache-inspired DRAM resilience architecture, which substantially reduces the area and latency overheads of correcting bit errors on random locations due to these faulty cells.
Abstract: Although aggressive technology scaling has allowed manufacturers to integrate Giga bits of cells into a cost-sensitive main memory DRAM device, these cells have become more defect-prone. With increased cell failure rates, conventional solutions such as populating spare DRAM rows and relying on error-correcting codes (ECCs) have shown limited success due to high area overhead, the latency penalties of data coding, and interference between ECC within a device (in-DRAM ECC) and other ECC across devices (rank-level ECC). In this paper, we propose CiDRA, a cache-inspired DRAM resilience architecture, which substantially reduces the area and latency overheads of correcting bit errors on random locations due to these faulty cells. We put a small SRAM cache within a DRAM device to replace accesses to the addresses including the faulty cells with ones that correspond to the cache data array. This CiDRA cache is paired with a Bloom filter to minimize the energy overhead of accessing the cache tags for every DRAM access and is also partitioned into small pieces, each being associated with the I/O pads for better area efficiency. Both the cache and DRAM banks are accessed in parallel while the banks are much slower. Consequently, the cache and filter are not in the critical path for normal DRAM accesses and incur no latency overhead. We also enhance the traditional in-DRAM ECC with error position bits and the appropriate error detecting capability while preventing interference with the traditional rank-level ECC scheme. By combining this enhanced in-DRAM ECC with the cache and Bloom filter, CiDRA becomes more area efficient because the in-DRAM ECC corrects most bit errors that are sporadic while the cache deals with the remaining relatively few pathological cases.

Proceedings ArticleDOI
04 Oct 2015
TL;DR: An approximation-aware MLC STT-RAM based cache architecture is proposed, which is partially-protected to restrict the reliability overhead and in turn leverages variable resilience characteristics of different applications for adaptively curtailing the protection overhead under a given error tolerance level to improve the energy-efficiency of the cache.
Abstract: Current manycore processors exhibit large on-chip last-level caches that may reach sizes of 32MB -- 128MB and incur high power/energy consumption. The emerging Multi-Level Cells (MLC) STT-RAM memory technology improves the capacity and energy efficiency issues of large-sized memory banks. However, MLC STT-RAM incurs non-negligible protection overhead to ensure reliable operations when compared to the Single-Level Cells (SLC) STT-RAM. In this paper, we propose an approximation-aware MLC STT-RAM cache architecture, which is partially-protected to restrict the reliability overhead and in turn leverages variable resilience characteristics of different applications for adaptively curtailing the protection overhead under a given error tolerance level. It thereby improves the energy-efficiency of the cache while meeting the reliability requirements. Our cache architecture is equipped with a latency-aware hardware module for double-error correction. To achieve high energy efficiency, approximation-aware read and write policies are proposed that perform approximate storage management while tolerating some errors bounded within the user-provided tolerance level. The architecture also facilitates runtime control on the quality of applications' results. We perform a case study on the next-generation advanced video encoding that exhibit memory-intensive functional blocks with varying resilience properties and support for parallelism. Experimental results demonstrate that our approximation-aware MLC STT-RAM based cache architecture can improve the energy efficiency compared to state-of-the-art fully-protected caches (7%--19%, on average), while incurring minimal quality penalties in the output (-0.219% to -0.426%, on average). Furthermore, our architecture supports complete error protection coverage for all cache data when processing non-resilient application. The hardware overhead to implement our approximation-aware management negligibly affects the energy efficiency (0.15%--1.3% of overhead) and the access latency (only 0.02%--1.56% of overhead).