scispace - formally typeset
Search or ask a question

Showing papers on "Smart Cache published in 2014"


Journal ArticleDOI
TL;DR: This paper proposes a novel coded caching scheme that exploits both local and global caching gains, leading to a multiplicative improvement in the peak rate compared with previously known schemes, and argues that the performance of the proposed scheme is within a constant factor of the information-theoretic optimum for all values of the problem parameters.
Abstract: Caching is a technique to reduce peak traffic rates by prefetching popular content into memories at the end users. Conventionally, these memories are used to deliver requested content in part from a locally cached copy rather than through the network. The gain offered by this approach, which we term local caching gain, depends on the local cache size (i.e., the memory available at each individual user). In this paper, we introduce and exploit a second, global, caching gain not utilized by conventional caching schemes. This gain depends on the aggregate global cache size (i.e., the cumulative memory available at all users), even though there is no cooperation among the users. To evaluate and isolate these two gains, we introduce an information-theoretic formulation of the caching problem focusing on its basic structure. For this setting, we propose a novel coded caching scheme that exploits both local and global caching gains, leading to a multiplicative improvement in the peak rate compared with previously known schemes. In particular, the improvement can be on the order of the number of users in the network. In addition, we argue that the performance of the proposed scheme is within a constant factor of the information-theoretic optimum for all values of the problem parameters.

1,857 citations


Proceedings ArticleDOI
10 Jun 2014
TL;DR: In this article, the authors studied the optimal cache content placement in a wireless small cell base station (sBS) with limited backhaul capacity, where the cache content content placement is optimized based on the demand history.
Abstract: Optimal cache content placement in a wireless small cell base station (sBS) with limited backhaul capacity is studied. The sBS has a large cache memory and provides content-level selective offloading by delivering high data rate contents to users in its coverage area. The goal of the sBS content controller (CC) is to store the most popular contents in the sBS cache memory such that the maximum amount of data can be fetched directly form the sBS, not relying on the limited backhaul resources during peak traffic periods. If the popularity profile is known in advance, the problem reduces to a knapsack problem. However, it is assumed in this work that, the popularity profile of the files is not known by the CC, and it can only observe the instantaneous demand for the cached content. Hence, the cache content placement is optimised based on the demand history. By refreshing the cache content at regular time intervals, the CC tries to learn the popularity profile, while exploiting the limited cache capacity in the best way possible. Three algorithms are studied for this cache content placement problem, leading to different exploitation-exploration trade-offs. We provide extensive numerical simulations in order to study the time-evolution of these algorithms, and the impact of the system parameters, such as the number of files, the number of users, the cache size, and the skewness of the popularity profile, on the performance. It is shown that the proposed algorithms quickly learn the popularity profile for a wide range of system parameters.

322 citations


Proceedings ArticleDOI
13 Dec 2014
TL;DR: A novel random fill cache architecture is proposed that replaces demand fetch with random cache fill within a configurable neighborhood window and shows that it provides information-theoretic security against reuse based attacks.
Abstract: Correctly functioning caches have been shown to leak critical secrets like encryption keys, through various types of cache side-channel attacks. This nullifies the security provided by strong encryption and allows confidentiality breaches, impersonation attacks and fake services. Hence, future cache designs must consider security, ideally without degrading performance and power efficiency. We introduce a new classification of cache side channel attacks: contention based attacks and reuse based attacks. Previous secure cache designs target only contention based attacks, and we show that they cannot defend against reuse based attacks. We show the surprising insight that the fundamental demand fetch policy of a cache is a security vulnerability that causes the success of reuse based attacks. We propose a novel random fill cache architecture that replaces demand fetch with random cache fill within a configurable neighborhood window. We show that our random fill cache does not degrade performance, and in fact, improves the performance for some types of applications. We also show that it provides information-theoretic security against reuse based attacks.

217 citations


Proceedings ArticleDOI
08 Jul 2014
TL;DR: In this paper, the authors propose a unified methodology to analyse the performance of caches (both isolated and interconnected), by extending and generalizing a decoupling technique originally known as Che's approximation, which provides very accurate results at low computational cost.
Abstract: We propose a unified methodology to analyse the performance of caches (both isolated and interconnected), by extending and generalizing a decoupling technique originally known as Che's approximation, which provides very accurate results at low computational cost. We consider several caching policies, taking into account the effects of temporal locality. In the case of interconnected caches, our approach allows us to do better than the Poisson approximation commonly adopted in prior work. Our results, validated against simulations and trace-driven experiments, provide interesting insights into the performance of caching systems.

187 citations


Proceedings ArticleDOI
13 Dec 2014
TL;DR: Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities and employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads.
Abstract: Recent research advocates large die-stacked DRAM caches in many core servers to break the memory latency and bandwidth wall. To realize their full potential, die-stacked DRAM caches necessitate low lookup latencies, high hit rates and the efficient use of off-chip bandwidth. Today's stacked DRAM cache designs fall into two categories based on the granularity at which they manage data: block-based and page-based. The state-of-the-art block-based design, called Alloy Cache, collocates a tag with each data block (e.g., 64B) in the stacked DRAM to provide fast access to data in a single DRAM access. However, such a design suffers from low hit rates due to poor temporal locality in the DRAM cache. In contrast, the state-of-the-art page-based design, called Footprint Cache, organizes the DRAM cache at page granularity (e.g., 4KB), but fetches only the blocks that will likely be touched within a page. In doing so, the Footprint Cache achieves high hit rates with moderate on-chip tag storage and reasonable lookup latency. However, multi-gigabyte stacked DRAM caches will soon be practical and needed by server applications, thereby mandating tens of MBs of tag storage even for page-based DRAM caches. We introduce a novel stacked-DRAM cache design, Unison Cache. Similar to Alloy Cache's approach, Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities. Then, leveraging the insights from the Footprint Cache design, Unison Cache employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads, while predicting and fetching only the useful blocks within each page to minimize the off-chip traffic. Our evaluation using server workloads and caches of up to 8GB reveals that Unison cache improves performance by 14% compared to Alloy Cache due to its high hit rate, while outperforming the state-of-the art page-based designs that require impractical SRAM-based tags of around 50MB.

162 citations


Proceedings ArticleDOI
19 Jun 2014
TL;DR: This paper proposes the memory request prioritization buffer (MRPB), a hardware structure that improves caching efficiency of massively parallel workloads by applying two prioritization methods-request reordering and cache bypassing-to memory requests before they access a cache.
Abstract: Massively parallel, throughput-oriented systems such as graphics processing units (GPUs) offer high performance for a broad range of programs. They are, however, complex to program, especially because of their intricate memory hierarchies with multiple address spaces. In response, modern GPUs have widely adopted caches, hoping to providing smoother reductions in memory access traffic and latency. Unfortunately, GPU caches often have mixed or unpredictable performance impact due to cache contention that results from the high thread counts in GPUs. We propose the memory request prioritization buffer (MRPB) to ease GPU programming and improve GPU performance. This hardware structure improves caching efficiency of massively parallel workloads by applying two prioritization methods—request reordering and cache bypassing—to memory requests before they access a cache. MRPB then releases requests into the cache in a more cache-friendly order. The result is drastically reduced cache contention and improved use of the limited per-thread cache capacity. For a simulated 16KB L1 cache, MRPB improves the average performance of the entire PolyBench and Rodinia suites by 2.65× and 1.27× respectively, outperforming a state-of-the-art GPU cache management technique.

158 citations


Proceedings ArticleDOI
10 Jun 2014
TL;DR: It is proved that the optimal online scheme has approximately the same performance as the optimal offline scheme, in which the cache contents can be updated based on the entire set of popular files before each new request.
Abstract: We consider a basic content distribution scenario consisting of a single origin server connected through a shared bottleneck link to a number of users each equipped with a cache of finite memory. The users issue a sequence of content requests from a set of popular files, and the goal is to operate the caches as well as the server such that these requests are satisfied with the minimum number of bits sent over the shared link. Assuming a basic Markov model for renewing the set of popular files, we characterize approximately the optimal long-term average rate of the shared link. We further prove that the optimal online scheme has approximately the same performance as the optimal offline scheme, in which the cache contents can be updated based on the entire set of popular files before each new request. To support these theoretical results, we propose an online coded caching scheme termed coded least-recently sent (LRS) and simulate it for a demand time series derived from the dataset made available by Netflix for the Netflix Prize. For this time series, we show that the proposed coded LRS algorithm significantly outperforms the popular least-recently used (LRU) caching algorithm.

155 citations


Proceedings ArticleDOI
08 Jul 2014
TL;DR: In this paper, the authors consider a network consisting of a file server connected through a shared link to a number of users, each equipped with a cache and show that caching only the most popular files can be highly suboptimal.
Abstract: We consider a network consisting of a file server connected through a shared link to a number of users, each equipped with a cache. Knowing the popularity distribution of the files, the goal is to optimally populate the caches, such as to minimize the expected load of the shared link. For a single cache, it is well known that storing the most popular files is optimal in this setting. However, we show here that this is no longer the case for multiple caches. Indeed, caching only the most popular files can be highly suboptimal. Instead, a fundamentally different approach is needed, in which the cache contents are used as side information for coded communication over the shared link. We propose such a coded caching scheme and prove that it is close to optimal.

145 citations


Proceedings ArticleDOI
13 Dec 2014
TL;DR: A specialized cache management policy for GPGPUs is proposed that is coordinated with warp throttling to dynamically control the active number of warps and a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources.
Abstract: With the SIMT execution model, GPUs can hidememory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been introduced to GPU architectures to capture temporal and spatial locality and mitigate the effect of irregular accesses. However, GPU caches exhibit poor efficiency due to the mismatch of the throughput-oriented execution model and its cache hierarchy design, which limits system performance and energy-efficiency. The massive amount of memory requests generated by GPU scause cache contention and resource congestion. Existing CPUcache management policies that are designed for multicoresystems, can be suboptimal when directly applied to GPUcaches. We propose a specialized cache management policy for GPGPUs. The cache hierarchy is protected from contention by the bypass policy based on reuse distance. Contention and resource congestion are detected at runtime. To avoid oversaturatingon-chip resources, the bypass policy is coordinated with warp throttling to dynamically control the active number of warps. We also propose a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources. Experimental results show that cache efficiency is significantly improved and on-chip resources are better utilized for cache sensitive benchmarks. This results in a harmonic mean IPCimprovement of 74% and 17% (maximum 661% and 44% IPCimprovement), compared to the baseline GPU architecture and optimal static warp throttling, respectively.

142 citations


Proceedings ArticleDOI
11 Aug 2014
TL;DR: In this paper, a hierarchical content delivery network with two layers of caches is considered, and a new caching scheme that combines two basic approaches is proposed to provide coded multicasting opportunities within each layer and across multiple layers.
Abstract: caching of popular content during off-peak hours is a strategy to reduce network loads during peak hours. Recent work has shown significant benefits of designing such caching strategies not only to locally deliver the part of the content, but also to provide coded multicasting opportunities even among users with different demands. Exploiting both of these gains was shown to be approximately optimal for caching systems with a single layer of caches. Motivated by practical scenarios, we consider, in this paper, a hierarchical content delivery network with two layers of caches. We propose a new caching scheme that combines two basic approaches. The first approach provides coded multicasting opportunities within each layer; the second approach provides coded multicasting opportunities across multiple layers. By striking the right balance between these two approaches, we show that the proposed scheme achieves the optimal communication rates to within a constant multiplicative and additive gap. We further show that there is no tension between the rates in each of the two layers up to the aforementioned gap. Thus, both the layers can simultaneously operate at approximately the minimum rate.

118 citations


Proceedings ArticleDOI
19 Jun 2014
TL;DR: This work extends reuse distance to GPUs by modelling the GPU's hierarchy of threads, warps, threadblocks, and sets of active threads, including conditional and non-uniform latencies, cache associativity, miss-status holding-registers, and warp divergence.
Abstract: As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality system-atically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: 1) the GPU's hierarchy of threads, warps, threadblocks, and sets of active threads, 2) conditional and non-uniform latencies, 3) cache associativity, 4) miss-status holding-registers, and 5) warp divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.

Journal ArticleDOI
TL;DR: A Time-To-Live (TTL) based caching model, that assigns a timer to each content stored in the cache and redraws it every time the content is requested (at each hit/miss), is introduced.

Proceedings ArticleDOI
24 Aug 2014
TL;DR: It is established that maintaining the tags in SRAM, because of its smaller access latency, leads to overall better performance and a simple technique is proposed to throttle the number of sets prefetched which can satisfy over 60% of DRAM cache tag accesses on average.
Abstract: 3D-stacking technology has enabled the option of embedding a large DRAM onto the processor. Prior works have proposed to use this as a DRAM cache. Because of its large size (a DRAM cache can be in the order of hundreds of megabytes), the total size of the tags associated with it can also be quite large (in the order of tens of megabytes). The large size of the tags has created a problem. Should we maintain the tags in the DRAM and pay the cost of a costly tag access in the critical path? Or should we maintain the tags in the faster SRAM by paying the area cost of a large SRAM for this purpose? Prior works have primarily chosen the former and proposed a variety of techniques for reducing the cost of a DRAM tag access. In this paper, we first establish (with the help of a study) that maintaining the tags in SRAM, because of its smaller access latency, leads to overall better performance. Motivated by this study, we ask if it is possible to maintain tags in SRAM without incurring high area overhead. Our key idea is simple. We propose to cache the tags in a small SRAM tag cache - we show that there is enough spatial and temporal locality amongst tag accesses to merit this idea. We propose the ATCache which is a small SRAM tag cache. Similar to a conventional cache, the ATCache caches recently accessed tags to exploit temporal locality; it exploits spatial locality by prefetching tags from nearby cache sets. In order to avoid the high miss latency and cache pollution caused by excessive prefetching, we use a simple technique to throttle the number of sets prefetched. Our proposed ATCache (which consumes 0.4% of overall tag size) can satisfy over 60% of DRAM cache tag accesses on average.

Journal ArticleDOI
TL;DR: This paper advances the state-of-the-art analysis of TTL-based cache networks by developing two exact methods with orthogonal generality and computational complexity.

Proceedings ArticleDOI
24 Sep 2014
TL;DR: Gains can be obtained provided that ideal Nearest Replica Routing forwarding and Leave a Copy Down meta-caching are jointly in use, and two alternative implementations that arbitrarily closely approximate iNRR behavior are provided.
Abstract: A recent debate revolves around the usefulness of pervasive caching, ie, adding caching capabilities to possibly every router of the future Internet Recent research argues against it, on the ground that it provides only very limited gain with respect to the current CDN scenario, where caching only happens at the network edgeIn this paper, we instead show that advantages of ubiquitous caching appear only when meta-caching (ie, whether or not cache the incoming object) and forwarding (ie, where to direct requests in case of cache miss) decisions are tightly coupled Summarizing our contributions, we (i) show that gains can be obtained provided that ideal Nearest Replica Routing (iNRR) forwarding and Leave a Copy Down (LCD) meta-caching are jointly in use, (ii) model the iNRR forwarding policy, (iii) provide two alternative implementations that arbitrarily closely approximate iNRR behavior, and (iv) promote cross-comparison by making our code available to the community

Proceedings ArticleDOI
11 Aug 2014
TL;DR: The main contribution of this work is the derivation of an information-theoretic outer bound for the multi-level setup and the demonstration that, under some natural regularity conditions, a memory-sharing scheme, which operates each level in isolation according to a single-level coded caching scheme, is in fact order-optimal with respect to this outer bound.
Abstract: Recent work has demonstrated that for content caching, joint design of storage and delivery can yield significant benefits over conventional caching approaches. This is based on storing content in the caches, so as to create coded-multicast opportunities even amongst users with different demands. Such a coded-caching scheme has been shown to be order-optimal for a caching system with single-level content, i.e., all content is uniformly popular. In this work, we consider a system with content divided into multiple levels, based on varying degrees of popularity. The main contribution of this work is the derivation of an information-theoretic outer bound for the multi-level setup, and the demonstration that under some natural regularity conditions, a memory-sharing scheme which operates each level in isolation according to a single-level coded caching scheme, is in fact order-optimal with respect to this outer bound.

Posted Content
TL;DR: In this article, the authors derived an order-optimal scheme which judiciously shares cache memory among files with different popularities, and derived new information-theoretic lower bounds, which use a sliding-window entropy inequality.
Abstract: To address the exponentially rising demand for wireless content, use of caching is emerging as a potential solution. It has been recently established that joint design of content delivery and storage (coded caching) can significantly improve performance over conventional caching. Coded caching is well suited to emerging heterogeneous wireless architectures which consist of a dense deployment of local-coverage wireless access points (APs) with high data rates, along with sparsely-distributed, large-coverage macro-cell base stations (BS). This enables design of coded caching-and-delivery schemes that equip APs with storage, and place content in them in a way that creates coded-multicast opportunities for combining with macro-cell broadcast to satisfy users even with different demands. Such coded-caching schemes have been shown to be order-optimal with respect to the BS transmission rate, for a system with single-level content, i.e., one where all content is uniformly popular. In this work, we consider a system with non-uniform popularity content which is divided into multiple levels, based on varying degrees of popularity. The main contribution of this work is the derivation of an order-optimal scheme which judiciously shares cache memory among files with different popularities. To show order-optimality we derive new information-theoretic lower bounds, which use a sliding-window entropy inequality, effectively creating a non-cutset bound. We also extend the ideas to when users can access multiple caches along with the broadcast. Finally we consider two extreme cases of user distribution across caches for the multi-level popularity model: a single user per cache (single-user setup) versus a large number of users per cache (multi-user setup), and demonstrate a dichotomy in the order-optimal strategies for these two extreme cases.

Proceedings ArticleDOI
24 Sep 2014
TL;DR: This work investigates network cache management and search policies that account for path-level (content-server to content-requestor) congestion and file popularity in order to directly minimize user-centric, content-download delay.
Abstract: The performance of in-network caching in information-centric networks, and of cache networks more generally, is typically characterized by network-centric performance metrics such as hit rate and hop count, with approaches to locating and caching content evaluated and optimized for these metrics. We believe that user-centric performance metrics, in particular the delay from when a content request is made by the user to the time at which the requested content has been completely downloaded, are also important. For such metrics, performance is often determined by link capacity constraints and network congestion. We investigate network cache management and search policies that account for path-level (content-server to content-requestor) congestion and file popularity in order to directly minimize user-centric, content-download delay. Through simulation, we find that our policies yield significantly better download delay performance than existing policies, even though these existing policies provide better performance according to traditional metrics such as cache hit rate and hop count.

Proceedings ArticleDOI
11 Aug 2014
TL;DR: Optimal cache content placement is studied in a wireless infostation network (WIN), which models a limited coverage wireless network with a large cache memory, formulated as a multi-armed bandit problem with switching cost, and an algorithm to solve it is presented.
Abstract: Optimal cache content placement is studied in a wireless infostation network (WIN), which models a limited coverage wireless network with a large cache memory. WIN provides content-level selective offloading by delivering high data rate contents stored in its cache memory to the users through a broadband connection. The goal of the WIN central controller (CC) is to store the most popular content in the cache memory of the WIN such that the maximum amount of data can be fetched directly from the cache rather than being downloaded from the core network. If the popularity profile of the available set of contents is known in advance, the optimization of the cache content reduces to a knapsack problem. However, it is assumed in this work that the popularity profile of the files is not known, and only the instantaneous demands for those contents stored in the cache can be observed. Hence, the cache content placement is optimised based on the demand history, and on the cost associated to placing each content in the cache. By refreshing the cache content at regular time intervals, the CC tries to learn the popularity profile, while at the same time exploiting the limited cache capacity in the best way possible. This problem is formulated as a multi-armed bandit problem with switching cost, and an algorithm to solve it is presented. The performance of the algorithm is measured in terms of regret, which is proven to be logarithmic and sub-linear uniformly over time for a specific and a general case, respectively.

Proceedings ArticleDOI
19 Mar 2014
TL;DR: By adaptively controlling the rate at which the client downloads video segments from the cache, the approach can reduce bitrate oscillations, prevent sudden rate changes, and provides traffic savings, and improves the quality of experience of clients.
Abstract: Video streaming is a major source of Internet traffic today and usage continues to grow at a rapid rate To cope with this new and massive source of traffic, ISPs use methods such as caching to reduce the amount of traffic traversing their networks and serve customers better However, the presence of a standard cache server in the video transfer path may result in bitrate oscillations and sudden rate changes for Dynamic Adaptive Streaming over HTTP (DASH) clients In this paper, we investigate the interactions between a client and a cache that result in these problems, and propose an approach to solve it By adaptively controlling the rate at which the client downloads video segments from the cache, we can ensure that clients will get smooth video We verify our results using simulation and show that compared to a standard cache our approach (1) can reduce bitrate oscillations (2) prevents sudden rate changes, and compared to a no-cache scenario (3) provides traffic savings, and (4) improves the quality of experience of clients

Proceedings ArticleDOI
08 Jul 2014
TL;DR: This paper solves the cache deployment optimization (CaDeOp) problem of determining how much server, energy, and bandwidth resources to provision in each cache AS, i.e., each AS chosen for cache deployment, as a mixed integer program (MIP).
Abstract: Content delivery networks (CDNs) deploy globally distributed systems of caches in a large number of autonomous systems (ASes) It is important for a CDN operator to satisfy the performance requirements of end users, while minimizing the cache deployment cost In this paper, we study the cache deployment optimization (CaDeOp) problem of determining how much server, energy, and bandwidth resources to provision in each cache AS, ie, each AS chosen for cache deployment The CaDeOp objective is to minimize the total cost incurred by the CDN, subject to meeting the end-user performance requirements We formulate the CaDeOp problem as a mixed integer program (MIP) and solve it for realistic AS-level topologies, traffic de- mands, and non-linear energy and bandwidth costs We also evaluate the sensitivity of the results to our parametric assump- tions When the end-user performance requirements become more stringent, the CDN footprint rapidly expands, requiring cache deployments in additional ASes and geographical regions Also, the CDN cost increases several times, with the cost balance shifting toward bandwidth and energy costs On the other hand, the traffic distribution among the cache ASes stays relatively even, with the top 20% of the cache ASes serving around 30% of the overall traffic

Proceedings ArticleDOI
08 Jul 2014
TL;DR: This paper proposes a distributed caching strategy along the data delivery path, called MAGIC (MAx-Gain In-network Caching), which aims to reduce bandwidth consumption by jointly considering the content popularity and hop reduction and takes the cache replacement penalty into account when making cache placement decisions.
Abstract: Information centric networks (ICNs) allow content objects to be cached within the network, so as to provide efficient data delivery. Existing works on in-network caches mainly focus on minimizing the redundancy of caches to improve the cache hit ratio, which may not lead to significant bandwidth saving. On the other hand, it could result in too frequent caching operations, i.e., cache placement and replacement, causing more power consumption at nodes, which shall be avoided in energy-limited data delivery environments, e.g., wireless networks. In this paper, we propose a distributed caching strategy along the data delivery path, called MAGIC (MAx-Gain In-network Caching). MAGIC aims to reduce bandwidth consumption by jointly considering the content popularity and hop reduction. We also take the cache replacement penalty into account when making cache placement decisions to reduce the number of caching operations. We compare our caching strategy with several state-of-art caching strategies in ICNs. Our results show that the MAGIC strategy can reduce up to 34.50% bandwidth consumption, save up to 17.91% server hit ratio, and reduce up to 38.84% caching operations compared with the existing best caching strategy when cache size is small, which is a significant improvement in wireless networks with limited cache size at each wireless node.

Proceedings ArticleDOI
01 Feb 2014
TL;DR: A Read-Write Partitioning (RWP) policy is proposed that minimizes read misses by dynamically partitioning the cache into clean and dirty partitions, where partitions grow in size if they are more likely to receive future read requests.
Abstract: Cache read misses stall the processor if there are no independent instructions to execute. In contrast, most cache write misses are off the critical path of execution, since writes can be buffered in the cache or the store buffer. With few exceptions, cache lines that serve loads are more critical for performance than cache lines that serve only stores. Unfortunately, traditional cache management mechanisms do not take into account this disparity between read-write criticality. This paper proposes a Read-Write Partitioning (RWP) policy that minimizes read misses by dynamically partitioning the cache into clean and dirty partitions, where partitions grow in size if they are more likely to receive future read requests. We show that exploiting the differences in read-write criticality provides better performance over prior cache management mechanisms. For a single-core system, RWP provides 5% average speedup across the entire SPEC CPU2006 suite, and 14% average speedup for cache-sensitive benchmarks, over the baseline LRU replacement policy. We also show that RWP can perform within 3% of a new yet complex instruction-address-based technique, Read Reference Predictor (RRP), that bypasses cache lines which are unlikely to receive any read requests, while requiring only 5.4% of RRP's state overhead. On a 4-core system, our RWP mechanism improves system throughput by 6% over the baseline and outperforms three other state-of-the-art mechanisms we evaluate.

Patent
30 Sep 2014
TL;DR: In this paper, a distributed caching of information using extended caching optimization is presented, where a mobile device for operating in a wireless network, monitoring requests issued from an application located within the device to an external entity not located inside the device, storing, in a local cache, responses to the monitored requests received from the external entity; and, in response to identifying a request as one that meets a first criterion for optimization, applying an extended caching optimisation, including preventing the identified request from being transmitted to the external entities and providing a response to the identified requests from the local cache.
Abstract: Methods and systems for distributed caching of information using extended caching optimization are provided. According to one aspect, a method for distributed caching of information using extended caching optimization includes, at a mobile device for operating in a wireless network, monitoring requests issued from an application located within the device to an external entity not located within the device; storing, in a local cache, responses to the monitored requests received from the external entity; and, in response to identifying a request as one that meets a first criterion for optimization, applying an extended caching optimization, including preventing the identified request from being transmitted to the external entity and providing a response to the identified request from the local cache.

Proceedings ArticleDOI
06 Apr 2014
TL;DR: In this paper, the authors proposed and analyzed a novel caching approach that can achieve significantly lower traffic compared to the traditional caching schemes, taking into account the fact that an operator can serve the requests for the same file that happen at nearby times via a single multicast transmission.
Abstract: The deployment of small cells is expected to gain huge momentum in the near future, as a solution for managing the skyrocketing mobile data demand growth. Local caching of popular files at the small cell base stations has been recently proposed, aiming at reducing the traffic incurred when transfer- ring the requested content from the core network to the users. In this paper, we propose and analyze a novel caching approach that can achieve significantly lower traffic compared to the traditional caching schemes. Our cache design policy carefully takes into account the fact that an operator can serve the requests for the same file that happen at nearby times via a single multicast transmission. The latter incurs less traffic as the requested file is transmitted to the users only once, rather than with many unicast transmissions. Systematic experiments demonstrate the effectiveness of our approach, as compared to the existing caching schemes.

Proceedings ArticleDOI
13 Dec 2014
TL;DR: The Skewed Compressed Cache (SCC), a new hardware compressed cache that lowers overheads and increases performance, is proposed using novel sparse super-block tags and a skewed associative mapping that takes compressed size into account.
Abstract: Cache compression seeks the benefits of a larger cache with the area and power of a smaller cache. Ideally, a compressed cache increases effective capacity by tightly compacting compressed blocks, has low tag and metadata overheads, and allows fast lookups. Previous compressed cache designs, however, fail to achieve all these goals. In this paper, we propose the Skewed Compressed Cache (SCC), a new hardware compressed cache that lowers overheads and increases performance. SCC tracks super blocks to reduce tag overhead, compacts blocks into a variable number of sub-blocks to reduce internal fragmentation, but retains a direct tag-data mapping to find blocks quickly and eliminate extra metadata (i.e., No backward pointers). Saccades this using novel sparse super-block tags and a skewed associative mapping that takes compressed size into account. In our experiments, SCC provides on average 8% (up to 22%) higher performance, and on average 6% (up to 20%) lower total energy, achieving the benefits of the recent Decoupled Compressed Cache [26] with a factor of 4 lower area overhead and lower design complexity.

Proceedings ArticleDOI
10 Jun 2014
TL;DR: This work proposes cache deduplication that effectively increases last- level cache capacity and detects duplicate data blocks and stores only one copy of the data in a way that can be accessed through multiple physical addresses.
Abstract: Caches are essential to the performance of modern micro- processors. Much recent work on last-level caches has focused on exploiting reference locality to improve efficiency. However, value redundancy is another source of potential improvement. We find that many blocks in the working set of typical benchmark programs have the same values. We propose cache deduplication that effectively increases last- level cache capacity. Rather than exploit specific value redundancy with compression, as in previous work, our scheme detects duplicate data blocks and stores only one copy of the data in a way that can be accessed through multiple physical addresses. We find that typical benchmarks exhibit significant value redundancy, far beyond the zero-content blocks one would expect in any program. Our deduplicated cache effectively increases capacity by an average of 112% com- pared to an 8MB last-level cache while reducing the physical area by 12.2%, yielding an average performance improvement of 15.2%.

Patent
17 Mar 2014
TL;DR: In this article, a hybrid storage system is described having a mixture of different types of storage devices comprising rotational drives, flash devices, SDRAM, and SRAM, which is used as the main storage, providing lowest cost per unit of storage memory.
Abstract: A hybrid storage system is described having a mixture of different types of storage devices comprising rotational drives, flash devices, SDRAM, and SRAM. The rotational drives are used as the main storage, providing lowest cost per unit of storage memory. Flash memory is used as a higher-level cache for rotational drives. Methods for managing multiple levels of cache for this storage system is provided having a very fast Level 1 cache which consists of volatile memory (SRAM or SDRAM), and a non-volatile Level 2 cache using an array of flash devices. It describes a method of distributing the data across the rotational drives to make caching more efficient. It also describes efficient techniques for flushing data from L1 cache and L2 cache to the rotational drives, taking advantage of concurrent flash devices operations, concurrent rotational drive operations, and maximizing sequential access types in the rotational drives rather than random accesses which are relatively slower. Methods provided here may be extended for systems that have more than two cache levels.

Proceedings ArticleDOI
02 Jun 2014
TL;DR: A Hierarchical Adaptive Replacement Cache (H-ARC) policy is proposed that considers all four factors of a page's status: dirty, clean, recency, and frequency, and it is very challenging to design a policy that can also increase the cache hit ratio for better system performance.
Abstract: With the rapid development of new types of nonvolatile memory (NVM), one of these technologies may replace DRAM as the main memory in the near future. Some drawbacks of DRAM, such as data loss due to power failure or a system crash can be remedied by NVM's non-volatile nature. In the meantime, solid state drives (SSDs) are becoming widely deployed as storage devices for faster random access speed compared with traditional hard disk drives (HDDs). For applications demanding higher reliability and better performance, using NVM as the main memory and SSDs as storage devices becomes a promising architecture. Although SSDs have better performance than HDDs, SSDs cannot support in-place updates (i.e., an erase operation has to be performed before a page can be updated) and suffer from a low endurance problem that each unit will wear out after certain number of erase operations. In an NVM based main memory, any updated pages called dirty pages can be kept longer without the urgent need to be flushed to SSDs. This difference opens an opportunity to design new cache policies that help extend the lifespan of SSDs by wisely choosing cache eviction victims to decrease storage write traffic. However, it is very challenging to design a policy that can also increase the cache hit ratio for better system performance. Most existing DRAM-based cache policies have mainly concentrated on the recency or frequency status of a page. On the other hand, most existing NVM-based cache policies have mainly focused on the dirty or clean status of a page. In this paper, by extending the concept of the Adaptive Replacement Cache (ARC), we propose a Hierarchical Adaptive Replacement Cache (H-ARC) policy that considers all four factors of a page's status: dirty, clean, recency, and frequency. Specifically, at the higher level, H-ARC adaptively splits the whole cache space into a dirty-page cache and a clean-page cache. At the lower level, inside the dirty-page cache and the clean-page cache, H-ARC splits them into a recency-page cache and a frequency-page cache separately. During the page eviction process, all parts of the cache will be balanced towards to their desired sizes.

Patent
Sanjeev N. Trika1
19 Feb 2014
TL;DR: In this paper, the authors propose a method and system to allow power fail-safe write-back or write-through caching of data in a persistent storage device into one or more cache lines of a caching device.
Abstract: A method and system to allow power fail-safe write-back or write-through caching of data in a persistent storage device into one or more cache lines of a caching device. No metadata associated with any of the cache lines is written atomically into the caching device when the data in the storage device is cached. As such, specialized cache hardware to allow atomic writing of metadata during the caching of data is not required.