scispace - formally typeset
Search or ask a question

Showing papers on "Cache algorithms published in 2014"


Proceedings Article
20 Aug 2014
TL;DR: This paper presents FLUSH+RELOAD, a cache side-channel attack technique that exploits a weakness in the Intel X86 processors to monitor access to memory lines in shared pages and recovers 96.7% of the bits of the secret key by observing a single signature or decryption round.
Abstract: Sharing memory pages between non-trusting processes is a common method of reducing the memory footprint of multi-tenanted systems In this paper we demonstrate that, due to a weakness in the Intel X86 processors, page sharing exposes processes to information leaks We present FLUSH+RELOAD, a cache side-channel attack technique that exploits this weakness to monitor access to memory lines in shared pages Unlike previous cache side-channel attacks, FLUSH+RELOAD targets the Last-Level Cache (ie L3 on processors with three cache levels) Consequently, the attack program and the victim do not need to share the execution core We demonstrate the efficacy of the FLUSH+RELOAD attack by using it to extract the private encryption keys from a victim program running GnuPG 1413 We tested the attack both between two unrelated processes in a single operating system and between processes running in separate virtual machines On average, the attack is able to recover 967% of the bits of the secret key by observing a single signature or decryption round

1,001 citations


Proceedings ArticleDOI
10 Jun 2014
TL;DR: In this article, the authors studied the optimal cache content placement in a wireless small cell base station (sBS) with limited backhaul capacity, where the cache content content placement is optimized based on the demand history.
Abstract: Optimal cache content placement in a wireless small cell base station (sBS) with limited backhaul capacity is studied. The sBS has a large cache memory and provides content-level selective offloading by delivering high data rate contents to users in its coverage area. The goal of the sBS content controller (CC) is to store the most popular contents in the sBS cache memory such that the maximum amount of data can be fetched directly form the sBS, not relying on the limited backhaul resources during peak traffic periods. If the popularity profile is known in advance, the problem reduces to a knapsack problem. However, it is assumed in this work that, the popularity profile of the files is not known by the CC, and it can only observe the instantaneous demand for the cached content. Hence, the cache content placement is optimised based on the demand history. By refreshing the cache content at regular time intervals, the CC tries to learn the popularity profile, while exploiting the limited cache capacity in the best way possible. Three algorithms are studied for this cache content placement problem, leading to different exploitation-exploration trade-offs. We provide extensive numerical simulations in order to study the time-evolution of these algorithms, and the impact of the system parameters, such as the number of files, the number of users, the cache size, and the skewness of the popularity profile, on the performance. It is shown that the proposed algorithms quickly learn the popularity profile for a wide range of system parameters.

322 citations


Proceedings ArticleDOI
13 Dec 2014
TL;DR: A novel random fill cache architecture is proposed that replaces demand fetch with random cache fill within a configurable neighborhood window and shows that it provides information-theoretic security against reuse based attacks.
Abstract: Correctly functioning caches have been shown to leak critical secrets like encryption keys, through various types of cache side-channel attacks. This nullifies the security provided by strong encryption and allows confidentiality breaches, impersonation attacks and fake services. Hence, future cache designs must consider security, ideally without degrading performance and power efficiency. We introduce a new classification of cache side channel attacks: contention based attacks and reuse based attacks. Previous secure cache designs target only contention based attacks, and we show that they cannot defend against reuse based attacks. We show the surprising insight that the fundamental demand fetch policy of a cache is a security vulnerability that causes the success of reuse based attacks. We propose a novel random fill cache architecture that replaces demand fetch with random cache fill within a configurable neighborhood window. We show that our random fill cache does not degrade performance, and in fact, improves the performance for some types of applications. We also show that it provides information-theoretic security against reuse based attacks.

217 citations


Proceedings ArticleDOI
13 Dec 2014
TL;DR: Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities and employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads.
Abstract: Recent research advocates large die-stacked DRAM caches in many core servers to break the memory latency and bandwidth wall. To realize their full potential, die-stacked DRAM caches necessitate low lookup latencies, high hit rates and the efficient use of off-chip bandwidth. Today's stacked DRAM cache designs fall into two categories based on the granularity at which they manage data: block-based and page-based. The state-of-the-art block-based design, called Alloy Cache, collocates a tag with each data block (e.g., 64B) in the stacked DRAM to provide fast access to data in a single DRAM access. However, such a design suffers from low hit rates due to poor temporal locality in the DRAM cache. In contrast, the state-of-the-art page-based design, called Footprint Cache, organizes the DRAM cache at page granularity (e.g., 4KB), but fetches only the blocks that will likely be touched within a page. In doing so, the Footprint Cache achieves high hit rates with moderate on-chip tag storage and reasonable lookup latency. However, multi-gigabyte stacked DRAM caches will soon be practical and needed by server applications, thereby mandating tens of MBs of tag storage even for page-based DRAM caches. We introduce a novel stacked-DRAM cache design, Unison Cache. Similar to Alloy Cache's approach, Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities. Then, leveraging the insights from the Footprint Cache design, Unison Cache employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads, while predicting and fetching only the useful blocks within each page to minimize the off-chip traffic. Our evaluation using server workloads and caches of up to 8GB reveals that Unison cache improves performance by 14% compared to Alloy Cache due to its high hit rate, while outperforming the state-of-the art page-based designs that require impractical SRAM-based tags of around 50MB.

162 citations


Proceedings ArticleDOI
19 Jun 2014
TL;DR: This paper proposes the memory request prioritization buffer (MRPB), a hardware structure that improves caching efficiency of massively parallel workloads by applying two prioritization methods-request reordering and cache bypassing-to memory requests before they access a cache.
Abstract: Massively parallel, throughput-oriented systems such as graphics processing units (GPUs) offer high performance for a broad range of programs. They are, however, complex to program, especially because of their intricate memory hierarchies with multiple address spaces. In response, modern GPUs have widely adopted caches, hoping to providing smoother reductions in memory access traffic and latency. Unfortunately, GPU caches often have mixed or unpredictable performance impact due to cache contention that results from the high thread counts in GPUs. We propose the memory request prioritization buffer (MRPB) to ease GPU programming and improve GPU performance. This hardware structure improves caching efficiency of massively parallel workloads by applying two prioritization methods—request reordering and cache bypassing—to memory requests before they access a cache. MRPB then releases requests into the cache in a more cache-friendly order. The result is drastically reduced cache contention and improved use of the limited per-thread cache capacity. For a simulated 16KB L1 cache, MRPB improves the average performance of the entire PolyBench and Rodinia suites by 2.65× and 1.27× respectively, outperforming a state-of-the-art GPU cache management technique.

158 citations


Proceedings ArticleDOI
10 Jun 2014
TL;DR: It is proved that the optimal online scheme has approximately the same performance as the optimal offline scheme, in which the cache contents can be updated based on the entire set of popular files before each new request.
Abstract: We consider a basic content distribution scenario consisting of a single origin server connected through a shared bottleneck link to a number of users each equipped with a cache of finite memory. The users issue a sequence of content requests from a set of popular files, and the goal is to operate the caches as well as the server such that these requests are satisfied with the minimum number of bits sent over the shared link. Assuming a basic Markov model for renewing the set of popular files, we characterize approximately the optimal long-term average rate of the shared link. We further prove that the optimal online scheme has approximately the same performance as the optimal offline scheme, in which the cache contents can be updated based on the entire set of popular files before each new request. To support these theoretical results, we propose an online coded caching scheme termed coded least-recently sent (LRS) and simulate it for a demand time series derived from the dataset made available by Netflix for the Netflix Prize. For this time series, we show that the proposed coded LRS algorithm significantly outperforms the popular least-recently used (LRU) caching algorithm.

155 citations


Proceedings ArticleDOI
08 Jul 2014
TL;DR: In this paper, the authors consider a network consisting of a file server connected through a shared link to a number of users, each equipped with a cache and show that caching only the most popular files can be highly suboptimal.
Abstract: We consider a network consisting of a file server connected through a shared link to a number of users, each equipped with a cache. Knowing the popularity distribution of the files, the goal is to optimally populate the caches, such as to minimize the expected load of the shared link. For a single cache, it is well known that storing the most popular files is optimal in this setting. However, we show here that this is no longer the case for multiple caches. Indeed, caching only the most popular files can be highly suboptimal. Instead, a fundamentally different approach is needed, in which the cache contents are used as side information for coded communication over the shared link. We propose such a coded caching scheme and prove that it is close to optimal.

145 citations


Proceedings ArticleDOI
13 Dec 2014
TL;DR: A specialized cache management policy for GPGPUs is proposed that is coordinated with warp throttling to dynamically control the active number of warps and a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources.
Abstract: With the SIMT execution model, GPUs can hidememory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been introduced to GPU architectures to capture temporal and spatial locality and mitigate the effect of irregular accesses. However, GPU caches exhibit poor efficiency due to the mismatch of the throughput-oriented execution model and its cache hierarchy design, which limits system performance and energy-efficiency. The massive amount of memory requests generated by GPU scause cache contention and resource congestion. Existing CPUcache management policies that are designed for multicoresystems, can be suboptimal when directly applied to GPUcaches. We propose a specialized cache management policy for GPGPUs. The cache hierarchy is protected from contention by the bypass policy based on reuse distance. Contention and resource congestion are detected at runtime. To avoid oversaturatingon-chip resources, the bypass policy is coordinated with warp throttling to dynamically control the active number of warps. We also propose a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources. Experimental results show that cache efficiency is significantly improved and on-chip resources are better utilized for cache sensitive benchmarks. This results in a harmonic mean IPCimprovement of 74% and 17% (maximum 661% and 44% IPCimprovement), compared to the baseline GPU architecture and optimal static warp throttling, respectively.

142 citations


Patent
14 Jul 2014
TL;DR: In this article, cache optimization techniques are employed to organize resources within caches such that the most requested content (e.g., the most popular content) is more readily available, and the resources propagate through a cache server hierarchy associated with the service provider.
Abstract: Resource management techniques, such as cache optimization, are employed to organize resources within caches such that the most requested content (e.g., the most popular content) is more readily available. A service provider utilizes content expiration data as indicative of resource popularity. As resources are requested, the resources propagate through a cache server hierarchy associated with the service provider. More frequently requested resources are maintained at edge cache servers based on shorter expiration data that is reset with each repeated request. Less frequently requested resources are maintained at higher levels of a cache server hierarchy based on longer expiration data associated with cache servers higher on the hierarchy.

136 citations


Proceedings ArticleDOI
24 Aug 2014
TL;DR: A memory management framework called COLORIS, which provides support for both static and dynamic cache partitioning using page coloring and monitors the cache miss rates of running applications and triggers re-partitioning of the cache to prevent miss rates exceeding applications-specific ranges.
Abstract: Shared caches in multicore processors are subject to contention from co-running threads. The resultant interference can lead to highly-variable performance for individual applications. This is particularly problematic for real-time applications, requiring predictable timing guarantees. Previous work has applied page coloring techniques to partition a shared cache, so that conflict misses are minimized amongst co-running workloads. However, prior page coloring techniques have not addressed the problem of partitioning a cache on over-committed processors where there are more executable threads than cores. Similarly, page coloring techniques have not proven efficient at adapting the cache partition sizes for threads with varying memory demands. This paper presents a memory management framework called COLORIS, which provides support for both static and dynamic cache partitioning using page coloring. COLORIS supports novel policies to reconfigure the assignment of page colors amongst application threads in over-committed systems. For quality-of-service (QoS), COLORIS monitors the cache miss rates of running applications and triggers re-partitioning of the cache to prevent miss rates exceeding applications-specific ranges. This paper presents the design and evaluation of COLORIS as applied to Linux. We show the efficiency and effectiveness of COLORIS to color memory pages for a set of SPEC CPU2006 workloads, thereby enhancing performance isolation over existing page coloring techniques.

133 citations


Journal ArticleDOI
TL;DR: A two timescale joint optimization of power and cache control to support real-time video streaming and the proposed solution is shown to be asymptotically optimal for high SNR and small timeslot duration.
Abstract: We propose a cache-enabled opportunistic cooperative MIMO (CoMP) framework for wireless video streaming. By caching a portion of the video files at the relays (RS) using a novel MDS-coded random cache scheme, the base station (BS) and RSs opportunistically employ CoMP to achieve spatial multiplexing gain without expensive payload backhaul. We study a two timescale joint optimization of power and cache control to support real-time video streaming. The cache control is to create more CoMP opportunities and is adaptive to the long-term popularity of the video files. The power control is to guarantee the QoS requirements and is adaptive to the channel state information (CSI), the cache state at the RS and the queue state information (QSI) at the users. The joint problem is decomposed into an inner power control problem and an outer cache control problem. We first derive a closed-form power control policy from an approximated Bellman equation. Based on this, we transform the outer problem into a convex stochastic optimization problem and propose a stochastic subgradient algorithm to solve it. Finally, the proposed solution is shown to be asymptotically optimal for high SNR and small timeslot duration. Its superior performance over various baselines is verified by simulations.

Journal ArticleDOI
TL;DR: The aim of this survey is to enable engineers and researchers to get insights into the techniques for improving cache power efficiency and motivate them to invent novel solutions for enabling low-power operation of caches.

Proceedings ArticleDOI
19 Jun 2014
TL;DR: This work extends reuse distance to GPUs by modelling the GPU's hierarchy of threads, warps, threadblocks, and sets of active threads, including conditional and non-uniform latencies, cache associativity, miss-status holding-registers, and warp divergence.
Abstract: As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality system-atically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: 1) the GPU's hierarchy of threads, warps, threadblocks, and sets of active threads, 2) conditional and non-uniform latencies, 3) cache associativity, 4) miss-status holding-registers, and 5) warp divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.

Journal ArticleDOI
TL;DR: A Time-To-Live (TTL) based caching model, that assigns a timer to each content stored in the cache and redraws it every time the content is requested (at each hit/miss), is introduced.

Proceedings ArticleDOI
24 Aug 2014
TL;DR: It is established that maintaining the tags in SRAM, because of its smaller access latency, leads to overall better performance and a simple technique is proposed to throttle the number of sets prefetched which can satisfy over 60% of DRAM cache tag accesses on average.
Abstract: 3D-stacking technology has enabled the option of embedding a large DRAM onto the processor. Prior works have proposed to use this as a DRAM cache. Because of its large size (a DRAM cache can be in the order of hundreds of megabytes), the total size of the tags associated with it can also be quite large (in the order of tens of megabytes). The large size of the tags has created a problem. Should we maintain the tags in the DRAM and pay the cost of a costly tag access in the critical path? Or should we maintain the tags in the faster SRAM by paying the area cost of a large SRAM for this purpose? Prior works have primarily chosen the former and proposed a variety of techniques for reducing the cost of a DRAM tag access. In this paper, we first establish (with the help of a study) that maintaining the tags in SRAM, because of its smaller access latency, leads to overall better performance. Motivated by this study, we ask if it is possible to maintain tags in SRAM without incurring high area overhead. Our key idea is simple. We propose to cache the tags in a small SRAM tag cache - we show that there is enough spatial and temporal locality amongst tag accesses to merit this idea. We propose the ATCache which is a small SRAM tag cache. Similar to a conventional cache, the ATCache caches recently accessed tags to exploit temporal locality; it exploits spatial locality by prefetching tags from nearby cache sets. In order to avoid the high miss latency and cache pollution caused by excessive prefetching, we use a simple technique to throttle the number of sets prefetched. Our proposed ATCache (which consumes 0.4% of overall tag size) can satisfy over 60% of DRAM cache tag accesses on average.

Journal ArticleDOI
14 Jun 2014
TL;DR: This paper presents, for the first time, a detailed design-space exploration of caches that utilize statistical compression and shows that more aggressive approaches like Huffman coding, which have been neglected in the past due to the high processing overhead for (de)compression, are suitable techniques for caches and memory.
Abstract: Low utilization of on-chip cache capacity limits performance and wastes energy because of the long latency, limited bandwidth, and energy consumption associated with off-chip memory accesses. Value replication is an important source of low capacity utilization. While prior cache compression techniques manage to code frequent values densely, they trade off a high compression ratio for low decompression latency, thus missing opportunities to utilize capacity more effectively.This paper presents, for the first time, a detailed designspace exploration of caches that utilize statistical compression. We show that more aggressive approaches like Huffman coding, which have been neglected in the past due to the high processing overhead for (de)compression, are suitable techniques for caches and memory. Based on our key observation that value locality varies little over time and across applications, we first demonstrate that the overhead of statistics acquisition for code generation is low because new encodings are needed rarely, making it possible to off-load it to software routines. We then show that the high compression ratio obtained by Huffman-coding makes it possible to utilize the performance benefits of 4X larger last-level caches with about 50% lower power consumption than such larger caches

Posted Content
TL;DR: In this article, the authors derived an order-optimal scheme which judiciously shares cache memory among files with different popularities, and derived new information-theoretic lower bounds, which use a sliding-window entropy inequality.
Abstract: To address the exponentially rising demand for wireless content, use of caching is emerging as a potential solution. It has been recently established that joint design of content delivery and storage (coded caching) can significantly improve performance over conventional caching. Coded caching is well suited to emerging heterogeneous wireless architectures which consist of a dense deployment of local-coverage wireless access points (APs) with high data rates, along with sparsely-distributed, large-coverage macro-cell base stations (BS). This enables design of coded caching-and-delivery schemes that equip APs with storage, and place content in them in a way that creates coded-multicast opportunities for combining with macro-cell broadcast to satisfy users even with different demands. Such coded-caching schemes have been shown to be order-optimal with respect to the BS transmission rate, for a system with single-level content, i.e., one where all content is uniformly popular. In this work, we consider a system with non-uniform popularity content which is divided into multiple levels, based on varying degrees of popularity. The main contribution of this work is the derivation of an order-optimal scheme which judiciously shares cache memory among files with different popularities. To show order-optimality we derive new information-theoretic lower bounds, which use a sliding-window entropy inequality, effectively creating a non-cutset bound. We also extend the ideas to when users can access multiple caches along with the broadcast. Finally we consider two extreme cases of user distribution across caches for the multi-level popularity model: a single user per cache (single-user setup) versus a large number of users per cache (multi-user setup), and demonstrate a dichotomy in the order-optimal strategies for these two extreme cases.

Proceedings ArticleDOI
11 Aug 2014
TL;DR: Optimal cache content placement is studied in a wireless infostation network (WIN), which models a limited coverage wireless network with a large cache memory, formulated as a multi-armed bandit problem with switching cost, and an algorithm to solve it is presented.
Abstract: Optimal cache content placement is studied in a wireless infostation network (WIN), which models a limited coverage wireless network with a large cache memory. WIN provides content-level selective offloading by delivering high data rate contents stored in its cache memory to the users through a broadband connection. The goal of the WIN central controller (CC) is to store the most popular content in the cache memory of the WIN such that the maximum amount of data can be fetched directly from the cache rather than being downloaded from the core network. If the popularity profile of the available set of contents is known in advance, the optimization of the cache content reduces to a knapsack problem. However, it is assumed in this work that the popularity profile of the files is not known, and only the instantaneous demands for those contents stored in the cache can be observed. Hence, the cache content placement is optimised based on the demand history, and on the cost associated to placing each content in the cache. By refreshing the cache content at regular time intervals, the CC tries to learn the popularity profile, while at the same time exploiting the limited cache capacity in the best way possible. This problem is formulated as a multi-armed bandit problem with switching cost, and an algorithm to solve it is presented. The performance of the algorithm is measured in terms of regret, which is proven to be logarithmic and sub-linear uniformly over time for a specific and a general case, respectively.

Proceedings ArticleDOI
19 Mar 2014
TL;DR: By adaptively controlling the rate at which the client downloads video segments from the cache, the approach can reduce bitrate oscillations, prevent sudden rate changes, and provides traffic savings, and improves the quality of experience of clients.
Abstract: Video streaming is a major source of Internet traffic today and usage continues to grow at a rapid rate To cope with this new and massive source of traffic, ISPs use methods such as caching to reduce the amount of traffic traversing their networks and serve customers better However, the presence of a standard cache server in the video transfer path may result in bitrate oscillations and sudden rate changes for Dynamic Adaptive Streaming over HTTP (DASH) clients In this paper, we investigate the interactions between a client and a cache that result in these problems, and propose an approach to solve it By adaptively controlling the rate at which the client downloads video segments from the cache, we can ensure that clients will get smooth video We verify our results using simulation and show that compared to a standard cache our approach (1) can reduce bitrate oscillations (2) prevents sudden rate changes, and compared to a no-cache scenario (3) provides traffic savings, and (4) improves the quality of experience of clients

Proceedings ArticleDOI
08 Jul 2014
TL;DR: This paper solves the cache deployment optimization (CaDeOp) problem of determining how much server, energy, and bandwidth resources to provision in each cache AS, i.e., each AS chosen for cache deployment, as a mixed integer program (MIP).
Abstract: Content delivery networks (CDNs) deploy globally distributed systems of caches in a large number of autonomous systems (ASes) It is important for a CDN operator to satisfy the performance requirements of end users, while minimizing the cache deployment cost In this paper, we study the cache deployment optimization (CaDeOp) problem of determining how much server, energy, and bandwidth resources to provision in each cache AS, ie, each AS chosen for cache deployment The CaDeOp objective is to minimize the total cost incurred by the CDN, subject to meeting the end-user performance requirements We formulate the CaDeOp problem as a mixed integer program (MIP) and solve it for realistic AS-level topologies, traffic de- mands, and non-linear energy and bandwidth costs We also evaluate the sensitivity of the results to our parametric assump- tions When the end-user performance requirements become more stringent, the CDN footprint rapidly expands, requiring cache deployments in additional ASes and geographical regions Also, the CDN cost increases several times, with the cost balance shifting toward bandwidth and energy costs On the other hand, the traffic distribution among the cache ASes stays relatively even, with the top 20% of the cache ASes serving around 30% of the overall traffic

Proceedings ArticleDOI
18 Jun 2014
TL;DR: This paper presents a new SSD prototype called DuraSSD equipped with tantalum capacitors, and it is the first time that a flash memory SSD with durable cache has been used to achieve an order of magnitude improvement in transaction throughput without compromising the atomicity and durability.
Abstract: In order to meet the stringent requirements of low latency as well as high throughput, web service providers with large data centers have been replacing magnetic disk drives with flash memory solid-state drives (SSDs). They commonly use relational and NoSQL database engines to manage OLTP workloads in the warehouse-scale computing environments. These modern database engines rely heavily on redundant writes and frequent cache flushes to guarantee the atomicity and durability of transactional updates. This has become a serious bottleneck of performance in both relational and NoSQL database engines. This paper presents a new SSD prototype called DuraSSD equipped with tantalum capacitors. The tantalum capacitors make the device cache inside DuraSSD durable, and additional firmware features of DuraSSD take advantage of the durable cache to support the atomicity and durability of page writes. It is the first time that a flash memory SSD with durable cache has been used to achieve an order of magnitude improvement in transaction throughput without compromising the atomicity and durability. Considering that the simple capacitors increase the total cost of an SSD no more than one percent, DuraSSD clearly provides a cost-effective means for transactional support. DuraSSD is also expected to alleviate the problem of high tail latency by minimizing write stalls.

Proceedings ArticleDOI
01 Feb 2014
TL;DR: A Read-Write Partitioning (RWP) policy is proposed that minimizes read misses by dynamically partitioning the cache into clean and dirty partitions, where partitions grow in size if they are more likely to receive future read requests.
Abstract: Cache read misses stall the processor if there are no independent instructions to execute. In contrast, most cache write misses are off the critical path of execution, since writes can be buffered in the cache or the store buffer. With few exceptions, cache lines that serve loads are more critical for performance than cache lines that serve only stores. Unfortunately, traditional cache management mechanisms do not take into account this disparity between read-write criticality. This paper proposes a Read-Write Partitioning (RWP) policy that minimizes read misses by dynamically partitioning the cache into clean and dirty partitions, where partitions grow in size if they are more likely to receive future read requests. We show that exploiting the differences in read-write criticality provides better performance over prior cache management mechanisms. For a single-core system, RWP provides 5% average speedup across the entire SPEC CPU2006 suite, and 14% average speedup for cache-sensitive benchmarks, over the baseline LRU replacement policy. We also show that RWP can perform within 3% of a new yet complex instruction-address-based technique, Read Reference Predictor (RRP), that bypasses cache lines which are unlikely to receive any read requests, while requiring only 5.4% of RRP's state overhead. On a 4-core system, our RWP mechanism improves system throughput by 6% over the baseline and outperforms three other state-of-the-art mechanisms we evaluate.

Patent
30 Apr 2014
TL;DR: In this article, the authors present a method, a system and a server of removing a distributed caching object from a cache server by comparing an active period of a located cache server with an expiration period associated with an object, thus saving the other cache servers from wasting resources to perform removal operations.
Abstract: The present disclosure discloses a method, a system and a server of removing a distributed caching object In one embodiment, the method receives a removal request, where the removal request includes an identifier of an object The method may further apply consistent Hashing to the identifier of the object to obtain a Hash result value of the identifier, locates a corresponding cache server based on the Hash result value and renders the corresponding cache server to be a present cache server In some embodiments, the method determines whether the present cache server is in an active status and has an active period greater than an expiration period associated with the object Additionally, in response to determining that the present cache server is in an active status and has an active period greater than the expiration period associated with the object, the method removes the object from the present cache server By comparing an active period of a located cache server with an expiration period associated with an object, the exemplary embodiments precisely locate a cache server that includes the object to be removed and perform a removal operation, thus saving the other cache servers from wasting resources to perform removal operations and hence improving the overall performance of the distributed cache system

Journal ArticleDOI
14 Jun 2014
TL;DR: STAG is proposed, a high density, energy-efficient GPGPU cache hierarchy design using a new spintronic memory technology called Domain Wall Memory (DWM), which inherently offer unprecedented benefits in density by storing multiple bits in the domains of a ferromagnetic nanowire.
Abstract: General-purpose Graphics Processing Units (GPGPUs) are widely used for executing massively parallel workloads from various application domains. Feeding data to the hundreds to thousands of cores that current GPGPUs integrate places great demands on the memory hierarchy, fueling an ever-increasing demand for on-chip memory.In this work, we propose STAG, a high density, energy-efficient GPGPU cache hierarchy design using a new spintronic memory technology called Domain Wall Memory (DWM). DWMs inherently offer unprecedented benefits in density by storing multiple bits in the domains of a ferromagnetic nanowire, which logically resembles a bit-serial tape. However, this structure also leads to a unique challenge that the bits must be sequentially accessed by performing "shift" operations, resulting in variable and potentially higher access latencies. To address this challenge, STAG utilizes a number of architectural techniques : (i) a hybrid cache organization that employs different DWM bit-cells to realize the different memory arrays within the GPGPU cache hierarchy, (ii) a clustered, bit-interleaved organization, in which the bits in a cache block are spread across a cluster of DWM tapes, allowing parallel access, (iii) tape head management policies that predictively configure DWM arrays to reduce the expected number of shift operations for subsequent accesses, and (iv) a shift aware pro- motion buffer (SaPB), in which accesses to the DWM cache are predicted based on intra-warp locality, and locations that would incur a large shift penalty are promoted to a smaller buffer. Over a wide range of benchmarks from the Rodinia, IS- PASS and Parboil suites, STAG achieves significant benefits in performance (12.1% over SRAM and 5.8% over STT-MRAM) and energy (3.3X over SRAM and 2.6X over STT-MRAM)

Proceedings ArticleDOI
27 Oct 2014
TL;DR: This work investigates the principal causes of load imbalance, including data co-location, non-ideal hashing scenarios, and hot-spot temporal effects, and employs trace-drive analytics to study the benefits and limitations of current load-balancing methods.
Abstract: Modern Web services rely extensively upon a tier of in-memory caches to reduce request latencies and alleviate load on backend servers. Within a given cache, items are typically partitioned across cache servers via consistent hashing, with the goal of balancing the number of items maintained by each cache server. Effects of consistent hashing vary by associated hashing function and partitioning ratio. Most real-world workloads are also skewed, with some items significantly more popular than others. Inefficiency in addressing both issues can create an imbalance in cache-server loads.We analyze the degree of observed load imbalance, focusing on read-only traffic against Facebook's graph cache tier in TAO. We investigate the principal causes of load imbalance, including data co-location, non-ideal hashing scenarios, and hot-spot temporal effects. We also employ trace-drive analytics to study the benefits and limitations of current load-balancing methods, suggesting areas for future research.

Proceedings ArticleDOI
13 Dec 2014
TL;DR: The Skewed Compressed Cache (SCC), a new hardware compressed cache that lowers overheads and increases performance, is proposed using novel sparse super-block tags and a skewed associative mapping that takes compressed size into account.
Abstract: Cache compression seeks the benefits of a larger cache with the area and power of a smaller cache. Ideally, a compressed cache increases effective capacity by tightly compacting compressed blocks, has low tag and metadata overheads, and allows fast lookups. Previous compressed cache designs, however, fail to achieve all these goals. In this paper, we propose the Skewed Compressed Cache (SCC), a new hardware compressed cache that lowers overheads and increases performance. SCC tracks super blocks to reduce tag overhead, compacts blocks into a variable number of sub-blocks to reduce internal fragmentation, but retains a direct tag-data mapping to find blocks quickly and eliminate extra metadata (i.e., No backward pointers). Saccades this using novel sparse super-block tags and a skewed associative mapping that takes compressed size into account. In our experiments, SCC provides on average 8% (up to 22%) higher performance, and on average 6% (up to 20%) lower total energy, achieving the benefits of the recent Decoupled Compressed Cache [26] with a factor of 4 lower area overhead and lower design complexity.

Proceedings ArticleDOI
02 Dec 2014
TL;DR: This work bridges this disconnect with a trace- driven study using 196M video requests from over 16M users on a country-wide topology with 80K routers to evaluate which ICN caching strategies work well on video work- loads and how ICN helps improve video-centric quality of experience (QoE).
Abstract: Even though a key driver for Information-Centric Networking (ICN) has been the rise in Internet video traffic, there has been surprisingly little work on analyzing the interplay between ICN and video ? which ICN caching strategies work well on video work- loads and how ICN helps improve video-centric quality of experience (QoE). In this work, we bridge this disconnect with a trace- driven study using 196M video requests from over 16M users on a country-wide topology with 80K routers. We evaluate a broad space of content replacement (e.g., LRU, LFU, FIFO) and content placement (e.g., leave a copy everywhere, probabilistic) strategies over a range of cache sizes. We highlight four key findings: (1) the best placement and re- placement strategies depend on the cache size and vary across improvement metrics; that said, LFU+probabilistic caching [37] is a close-to-optimal strategy overall; (2) video workloads show considerable caching-related benefits (e.g., -- 10% traffic reduction) only with very large cache sizes (≥ 100GB); (3) the improvement in video QoE is low (≥ 12%) if the content provider already has a substantial geographical presence; and (4) caches in the middle and the edge of the network, requests from highly populated regions and without content servers, and requests for popular content contribute most to the overall ICN-induced improvements in video QoE.

Proceedings ArticleDOI
10 Jun 2014
TL;DR: This work proposes cache deduplication that effectively increases last- level cache capacity and detects duplicate data blocks and stores only one copy of the data in a way that can be accessed through multiple physical addresses.
Abstract: Caches are essential to the performance of modern micro- processors. Much recent work on last-level caches has focused on exploiting reference locality to improve efficiency. However, value redundancy is another source of potential improvement. We find that many blocks in the working set of typical benchmark programs have the same values. We propose cache deduplication that effectively increases last- level cache capacity. Rather than exploit specific value redundancy with compression, as in previous work, our scheme detects duplicate data blocks and stores only one copy of the data in a way that can be accessed through multiple physical addresses. We find that typical benchmarks exhibit significant value redundancy, far beyond the zero-content blocks one would expect in any program. Our deduplicated cache effectively increases capacity by an average of 112% com- pared to an 8MB last-level cache while reducing the physical area by 12.2%, yielding an average performance improvement of 15.2%.

Patent
17 Mar 2014
TL;DR: In this article, a hybrid storage system is described having a mixture of different types of storage devices comprising rotational drives, flash devices, SDRAM, and SRAM, which is used as the main storage, providing lowest cost per unit of storage memory.
Abstract: A hybrid storage system is described having a mixture of different types of storage devices comprising rotational drives, flash devices, SDRAM, and SRAM. The rotational drives are used as the main storage, providing lowest cost per unit of storage memory. Flash memory is used as a higher-level cache for rotational drives. Methods for managing multiple levels of cache for this storage system is provided having a very fast Level 1 cache which consists of volatile memory (SRAM or SDRAM), and a non-volatile Level 2 cache using an array of flash devices. It describes a method of distributing the data across the rotational drives to make caching more efficient. It also describes efficient techniques for flushing data from L1 cache and L2 cache to the rotational drives, taking advantage of concurrent flash devices operations, concurrent rotational drive operations, and maximizing sequential access types in the rotational drives rather than random accesses which are relatively slower. Methods provided here may be extended for systems that have more than two cache levels.

Proceedings ArticleDOI
02 Jun 2014
TL;DR: A Hierarchical Adaptive Replacement Cache (H-ARC) policy is proposed that considers all four factors of a page's status: dirty, clean, recency, and frequency, and it is very challenging to design a policy that can also increase the cache hit ratio for better system performance.
Abstract: With the rapid development of new types of nonvolatile memory (NVM), one of these technologies may replace DRAM as the main memory in the near future. Some drawbacks of DRAM, such as data loss due to power failure or a system crash can be remedied by NVM's non-volatile nature. In the meantime, solid state drives (SSDs) are becoming widely deployed as storage devices for faster random access speed compared with traditional hard disk drives (HDDs). For applications demanding higher reliability and better performance, using NVM as the main memory and SSDs as storage devices becomes a promising architecture. Although SSDs have better performance than HDDs, SSDs cannot support in-place updates (i.e., an erase operation has to be performed before a page can be updated) and suffer from a low endurance problem that each unit will wear out after certain number of erase operations. In an NVM based main memory, any updated pages called dirty pages can be kept longer without the urgent need to be flushed to SSDs. This difference opens an opportunity to design new cache policies that help extend the lifespan of SSDs by wisely choosing cache eviction victims to decrease storage write traffic. However, it is very challenging to design a policy that can also increase the cache hit ratio for better system performance. Most existing DRAM-based cache policies have mainly concentrated on the recency or frequency status of a page. On the other hand, most existing NVM-based cache policies have mainly focused on the dirty or clean status of a page. In this paper, by extending the concept of the Adaptive Replacement Cache (ARC), we propose a Hierarchical Adaptive Replacement Cache (H-ARC) policy that considers all four factors of a page's status: dirty, clean, recency, and frequency. Specifically, at the higher level, H-ARC adaptively splits the whole cache space into a dirty-page cache and a clean-page cache. At the lower level, inside the dirty-page cache and the clean-page cache, H-ARC splits them into a recency-page cache and a frequency-page cache separately. During the page eviction process, all parts of the cache will be balanced towards to their desired sizes.