scispace - formally typeset
Search or ask a question

Showing papers on "Cache algorithms published in 2017"


Proceedings Article
01 Feb 2017
TL;DR: This paper presents the Compute Cache architecture that enables in-place computation in caches, which uses emerging bit-line SRAM circuit technology to repurpose existing cache elements and transforms them into active very large vector computational units.
Abstract: This paper presents the Compute Cache architecturethat enables in-place computation in caches. ComputeCaches uses emerging bit-line SRAM circuit technology to repurpose existing cache elements and transforms them into active very large vector computational units. Also, it significantlyreduces the overheads in moving data between different levelsin the cache hierarchy. Solutions to satisfy new constraints imposed by ComputeCaches such as operand locality are discussed. Also discussedare simple solutions to problems in integrating them into aconventional cache hierarchy while preserving properties suchas coherence, consistency, and reliability. Compute Caches increase performance by 1.9× and reduceenergy by 2.4× for a suite of data-centric applications, includingtext and database query processing, cryptographic kernels, and in-memory checkpointing. Applications with larger fractionof Compute Cache operations could benefit even more, asour micro-benchmarks indicate (54× throughput, 9× dynamicenergy savings).

225 citations


Proceedings Article
16 Aug 2017
TL;DR: Cloak, a new technique that uses hardware transactional memory to prevent adversarial observation of cache misses on sensitive code and data, provides strong protection against all known cache-based side-channel attacks with low performance overhead.
Abstract: Cache-based side-channel attacks are a serious problem in multi-tenant environments, for example, modern cloud data centers. We address this problem with Cloak, a new technique that uses hardware transactional memory to prevent adversarial observation of cache misses on sensitive code and data. We show that Cloak provides strong protection against all known cache-based side-channel attacks with low performance overhead. We demonstrate the efficacy of our approach by retrofitting vulnerable code with Cloak and experimentally confirming immunity against state-of-the-art attacks. We also show that by applying Cloak to code running inside Intel SGX enclaves we can effectively block information leakage through cache side channels from enclaves, thus addressing one of the main weaknesses of SGX.

194 citations


Proceedings ArticleDOI
25 Jun 2017
TL;DR: This work considers a system where a local cache maintains a collection of N dynamic content items that are randomly requested by local users and shows that an asymptotically optimal policy updates a cached item in proportion to the square root of the item's popularity.
Abstract: We consider a system where a local cache maintains a collection of N dynamic content items that are randomly requested by local users. A capacity-constrained link to a remote network server limits the ability of the cache to hold the latest version of each item at all times, making it necessary to design an update policy. Using an age of information metric, we show under a relaxed problem formulation that an asymptotically optimal policy updates a cached item in proportion to the square root of the item's popularity. We then show experimentally that a physically realizable policy closely approximates the asymptotic optimal policy.

169 citations


Journal ArticleDOI
TL;DR: In this paper, the authors studied the cache placement problem in fog radio access networks (Fog-RANs), by taking into account flexible physical-layer transmission schemes and diverse content preferences of different users.
Abstract: To deal with the rapid growth of high-speed and/or ultra-low latency data traffic for massive mobile users, fog radio access networks (Fog-RANs) have emerged as a promising architecture for next-generation wireless networks. In Fog-RANs, the edge nodes and user terminals possess storage, computation and communication functionalities to various degrees, which provide high flexibility for network operation, i.e., from fully centralized to fully distributed operation. In this paper, we study the cache placement problem in Fog-RANs, by taking into account flexible physical-layer transmission schemes and diverse content preferences of different users. We develop both centralized and distributed transmission aware cache placement strategies to minimize users’ average download delay subject to the storage capacity constraints. In the centralized mode, the cache placement problem is transformed into a matroid constrained submodular maximization problem, and an approximation algorithm is proposed to find a solution within a constant factor to the optimum. In the distributed mode, a belief propagation-based distributed algorithm is proposed to provide a suboptimal solution, with iterative updates at each BS based on locally collected information. Simulation results show that by exploiting caching and cooperation gains, the proposed transmission aware caching algorithms can greatly reduce the users’ average download delay.

126 citations


Proceedings ArticleDOI
24 Jun 2017
TL;DR: This paper proposes to alter the line replacement algorithm of the shared cache, to prevent a process from creating inclusion victims in the caches of cores running other processes, and calls it SHARP (Secure Hierarchy-Aware cache Replacement Policy).
Abstract: In cache-based side channel attacks, a spy that shares a cache with a victim probes cache locations to extract information on the victim's access patterns. For example, in evict+reload, the spy repeatedly evicts and then reloads a probe address, checking if the victim has accessed the address in between the two operations. While there are many proposals to combat these cache attacks, they all have limitations: they either hurt performance, require programmer intervention, or can only defend against some types of attacks.This paper makes the following observation for an environment with an inclusive cache hierarchy: when the spy evicts the probe address from the shared cache, the address will also be evicted from the private cache of the victim process, creating an inclusion victim. Consequently, to disable cache attacks, this paper proposes to alter the line replacement algorithm of the shared cache, to prevent a process from creating inclusion victims in the caches of cores running other processes. By enforcing this rule, the spy cannot evict the probe address from the shared cache and, hence, cannot glimpse any information on the victim's access patterns. We call our proposal SHARP (Secure Hierarchy-Aware cache Replacement Policy). SHARP efficiently defends against all existing cross-core shared-cache attacks, needs only minimal hardware modifications, and requires no code modifications. We implement SHARP in a cycle-level full-system simulator. We show that it protects against real-world attacks, and that it introduces negligible average performance degradation.

109 citations


Journal ArticleDOI
TL;DR: This work investigates the problem of developing optimal joint routing and caching policies in a network supporting in-network caching with the goal of minimizing expected content-access delay and identifies the structural property of the user-cache graph that makes the problem NP-complete.
Abstract: In-network content caching has been deployed in both the Internet and cellular networks to reduce content-access delay. We investigate the problem of developing optimal joint routing and caching policies in a network supporting in-network caching with the goal of minimizing expected content-access delay. Here, needed content can either be accessed directly from a back-end server (where content resides permanently) or be obtained from one of multiple in-network caches. To access content, users must thus decide whether to route their requests to a cache or to the back-end server. In addition, caches must decide which content to cache. We investigate two variants of the problem, where the paths to the back-end server can be considered as either congestion-sensitive or congestion-insensitive, reflecting whether or not the delay experienced by a request sent to the back-end server depends on the request load, respectively. We show that the problem of optimal joint caching and routing is NP-complete in both cases. We prove that under the congestion-insensitive delay model, the problem can be solved optimally in polynomial time if each piece of content is requested by only one user, or when there are at most two caches in the network. We also identify the structural property of the user-cache graph that makes the problem NP-complete. For the congestion-sensitive delay model, we prove that the problem remains NP-complete even if there is only one cache in the network and each content is requested by only one user. We show that approximate solutions can be found for both cases within a $(1-1/e)$ factor from the optimal, and demonstrate a greedy solution that is numerically shown to be within 1% of optimal for small problem sizes. Through trace-driven simulations, we evaluate the performance of our greedy solutions to joint caching and routing, which show up to 50% reduction in average delay over the solution of optimized routing to least recently used caches.

107 citations


Posted Content
TL;DR: In this article, the authors studied the cache placement problem in fog-RANs, by taking into account flexible physical-layer transmission schemes and diverse content preferences of different users, and developed both centralized and distributed transmission aware cache placement strategies to minimize users' average download delay subject to the storage capacity constraints.
Abstract: To deal with the rapid growth of high-speed and/or ultra-low latency data traffic for massive mobile users, fog radio access networks (Fog-RANs) have emerged as a promising architecture for next-generation wireless networks. In Fog-RANs, the edge nodes and user terminals possess storage, computation and communication functionalities to various degrees, which provides high flexibility for network operation, i.e., from fully centralized to fully distributed operation. In this paper, we study the cache placement problem in Fog-RANs, by taking into account flexible physical-layer transmission schemes and diverse content preferences of different users. We develop both centralized and distributed transmission aware cache placement strategies to minimize users' average download delay subject to the storage capacity constraints. In the centralized mode, the cache placement problem is transformed into a matroid constrained submodular maximization problem, and an approximation algorithm is proposed to find a solution within a constant factor to the optimum. In the distributed mode, a belief propagation based distributed algorithm is proposed to provide a suboptimal solution, with iterative updates at each BS based on locally collected information. Simulation results show that by exploiting caching and cooperation gains, the proposed transmission aware caching algorithms can greatly reduce the users' average download delay.

94 citations


Proceedings ArticleDOI
25 Jun 2017
TL;DR: In this article, the authors considered a basic caching system, where a single server with a database of N files (e.g. movies) is connected to a set of K users through a shared bottleneck link.
Abstract: We consider a basic caching system, where a single server with a database of N files (e.g. movies) is connected to a set of K users through a shared bottleneck link. Each user has a local cache memory with a size of M files. The system operates in two phases: a placement phase, where each cache memory is populated up to its size from the database, and a following delivery phase, where each user requests a file from the database, and the server is responsible for delivering the requested contents. The objective is to design the two phases to minimize the load (peak or average) of the bottleneck link. We characterize the rate-memory tradeoff of the above caching system within a factor of 2.00884 for both the peak rate and the average rate (under uniform file popularity), where the best proved characterization in the current literature gives a factor of 4 and 4.7 respectively. Moreover, in the practically important case where the number of files (N) is large, we exactly characterize the tradeoff for systems with no more than 5 users, and characterize the tradeoff within a factor of 2 otherwise. We establish these results by developing novel information theoretic outer-bounds for the caching problem, which improves the state of the art and gives tight characterization in various cases.

81 citations


Journal ArticleDOI
TL;DR: Simulation results show that the proposed method outperforms the exiting counterparts with a higher hit ratio and lower delay of delivering video contents, and leveraging the backward induction method, the optimal strategy of each player in the game model is proposed.
Abstract: To improve the performance of mobile video delivery, caching layered videos at a site near to mobile end users (e.g., at the edge of mobile service provider's backbone) was advocated because cached videos can be delivered to mobile users with a high quality of experience, e.g., a short latency. How to optimally cache layered videos based on caching price, the available capacity of cache nodes, and the social features of mobile users, however, is still a challenging issue. In this paper, we propose a novel edge caching scheme to cache layered videos. First, a framework to cache layered videos is presented in which a cache node stores layered videos for multiple social groups, formed by mobile users based on their requests. Due to the limited capacity of the cache node, these social groups compete with each other for the number of layers they request to cache, aiming at maximizing their utilities while all mobile users in each group share the cost involved in the cache of video contents. Second, a Stackelberg game model is developed to study the interaction among multiple social groups and the cache node, and a noncooperative game model is introduced to analyze the competition among mobile users in different social groups. Third, leveraging the backward induction method, the optimal strategy of each player in the game model is proposed. Finally, simulation results show that the proposed method outperforms the exiting counterparts with a higher hit ratio and lower delay of delivering video contents.

74 citations


Proceedings ArticleDOI
14 Oct 2017
TL;DR: A novel probabilistic information flow graph is proposed to model the interaction between the victim program, the attacker program and the cache architecture, and a new metric, the Probability of Attack Success (PAS), is derived, which gives a quantitative measure for evaluating a cache’s resilience against a given class of cache side-channel attacks.
Abstract: Security-critical data can leak through very unexpected side channels, making side-channel attacks very dangerous threats to information security. Of these, cache-based side-channel attacks are some of the most problematic. This is because caches are essential for the performance of modern computers, but an intrinsic property of all caches – the different access times for cache hits and misses – is the property exploited to leak information in time-based cache side-channel attacks. Recently, different secure cache architectures have been proposed to defend against these attacks. However, we do not have a reliable method for evaluating a cache’s resilience against different classes of cache side-channel attacks, which is the goal of this paper.We first propose a novel probabilistic information flow graph (PIFG) to model the interaction between the victim program, the attacker program and the cache architecture. From this model, we derive a new metric, the Probability of Attack Success (PAS), which gives a quantitative measure for evaluating a cache’s resilience against a given class of cache side-channel attacks. We show the generality of our model and metric by applying them to evaluate nine different cache architectures against all four classes of cache side-channel attacks. Our new methodology, model and metric can help verify the security provided by different proposed secure cache architectures, and compare them in terms of their resilience to cache side-channel attacks, without the need for simulation or taping out a chip.CCS CONCEPTS• Security and privacy $\rightarrow $ Side-channel analysis and counter-measures; • General and reference $\rightarrow$ Evaluation; • Computer systems organization $\rightarrow $ Processors and memory architectures;

72 citations


Proceedings ArticleDOI
14 Oct 2017
TL;DR: Banshee is a new DRAM cache design that optimizes for both in-package and off-package DRAM bandwidth efficiency without degrading access latency and reduces unnecessary DRAM caches replacement traffic with a new bandwidth-aware frequency-based replacement policy.
Abstract: Placing the DRAM in the same package as a processor enables several times higher memory bandwidth than conventional off-package DRAM. Yet, the latency of in-package DRAM is not appreciably lower than that of off-package DRAM. A promising use of in-package DRAM is as a large cache. Unfortunately, most previous DRAM cache designs optimize mainly for cache hit latency and do not consider bandwidth efficiency as a first-class design constraint. Hence, as we show in this paper, these designs are suboptimal for use with in-package DRAM.We propose a new DRAM cache design, Banshee, that optimizes for both in-package and off-package DRAM bandwidth efficiency without degrading access latency. Banshee is based on two key ideas. First, it eliminates the tag lookup overhead by tracking the contents of the DRAM cache using TLBs and page table entries, which is efficiently enabled by a new lightweight TLB coherence protocol we introduce. Second, it reduces unnecessary DRAM cache replacement traffic with a new bandwidth-aware frequency-based replacement policy. Our evaluations show that Banshee significantly improves performance (15% on average) and reduces DRAM traffic (35.8% on average) over the best-previous latency-optimized DRAM cache design.CCS CONCEPTS•Computersystemsorganization → Multicore architectures; {\it Heterogeneous (hybrid) systems;

Journal ArticleDOI
TL;DR: A popularity prediction-based cooperative cache replacement mechanism, which predicts and ranks popular content during a period of time is put forward, which aims to lower the cache replacement overhead and reduce the cache redundancy.
Abstract: Information centric networking (ICN) has been recently proposed as a prominent solution for content delivery in vehicular ad hoc networks. By caching the data packets in vehicular unused storage space, vehicles can obtain the replicate of contents from other vehicles instead of original content provider, which reduces the access pressure of content provider and increases the response speed of content request. In this paper, we propose a community similarity and population-based cache policy in an ICN vehicle-to-vehicle scenario. First, a dynamic probability caching scheme is designed by evaluating the community similarity and privacy rating of vehicles. Then, a caching vehicle selection method with hop numbers based on content popularity is proposed to reduce the cache redundancy. Moreover, to lower the cache replacement overhead, we put forward a popularity prediction-based cooperative cache replacement mechanism, which predicts and ranks popular content during a period of time. Simulation results show that the performance of our proposed mechanisms is greatly outstanding in reducing the average time delay and increasing the cache hit ratio and the cache hit distance.

Proceedings Article
12 Jul 2017
TL;DR: This work designs a new caching algorithm for web applications called hyperbolic caching, which decays item priorities at variable rates and continuously reorders many items at once and introduces the notion of a cost class in order to measure the costs and manipulate the priorities of all items belonging to a related group.
Abstract: Today's web applications rely heavily on caching to reduce latency and backend load, using services like Redis or Memcached that employ inflexible caching algorithms. But the needs of each application vary, and significant performance gains can be achieved with a tailored strategy, e.g., incorporating cost of fetching, expiration time, and so forth. Existing strategies are fundamentally limited, however, because they rely on data structures to maintain a total ordering of the cached items. Inspired by Redis's use of random sampling for eviction (in lieu of a data structure) and recent theoretical justification for this approach, we design a new caching algorithm for web applications called hyperbolic caching. Unlike prior schemes, hyperbolic caching decays item priorities at variable rates and continuously reorders many items at once. By combining random sampling with lazy evaluation of the hyperbolic priority function, we gain complete flexibility in customizing the function. For example, we describe extensions that incorporate item cost, expiration time, and windowing. We also introduce the notion of a cost class in order to measure the costs and manipulate the priorities of all items belonging to a related group. We design a hyperbolic caching variant for several production systems from leading cloud providers. We implement our scheme in Redis and the Django web framework. Using real and simulated traces, we show that hyperbolic caching reduces miss rates by ∼10-20% over competitive baselines tailored to the application, and improves end-to-end throughput by ∼5-10%.

Proceedings ArticleDOI
14 Oct 2017
TL;DR: Cache Automaton as discussed by the authors extends a conventional last-level cache architecture with components to accelerate two phases in NFA processing: state-match and state-transition, which is made efficient using a sense-amplifier cycling technique that exploits spatial locality in symbol matches.
Abstract: Finite State Automata are widely used to accelerate pattern matching in many emerging application domains like DNA sequencing and XML parsing. Conventional CPUs and compute-centric accelerators are bottlenecked by memory bandwidth and irregular memory access patterns in automata processing. We present Cache Automaton, which repurposes last-level cache for automata processing, and a compiler that automates the process of mapping large real world Non-Deterministic Finite Automata (NFAs) to the proposed architecture. Cache Automaton extends a conventional last-level cache architecture with components to accelerate two phases in NFA processing: state-match and state-transition. State-matching is made efficient using a sense-amplifier cycling technique that exploits spatial locality in symbol matches. State-transition is made efficient using a new compact switch architecture. By overlapping these two phases for adjacent symbols we realize an efficient pipelined design. We evaluate two designs, one optimized for performance and the other optimized for space, across a set of 20 diverse benchmarks. The performance optimized design provides a speedup of 15× over DRAM-based Micron’s Automata Processor and 3840× speedup over processing in a conventional x86 CPU. The proposed design utilizes on an average 1.2 MB of cache space across benchmarks, while consuming 2.3 nJ of energy per input symbol. Our space optimized design can reduce the cache utilization to 0.72 MB, while still providing a speedup of 9× over AP. CCS CONCEPTS • Hardware → Emerging architectures; • Theory of computation → Formal languages and automata theory;

Journal ArticleDOI
TL;DR: This letter studies the optimization for cache content placement to minimize the backhaul load subject to cache capacity constraints for caching enabled small cell networks with heterogeneous file and cache sizes.
Abstract: In this letter, we study the optimization for cache content placement to minimize the backhaul load subject to cache capacity constraints for caching enabled small cell networks with heterogeneous file and cache sizes. Multicast content delivery is adopted to reduce the backhaul rate exploiting the independence among maximum distance separable coded packets.

Proceedings ArticleDOI
Meng Xu1, Linh Thi, Xuan Phan1, Hyon-Young Choi1, Insup Lee1 
01 Apr 2017
TL;DR: In this paper, the authors present vCAT, a novel design for dynamic shared cache management on multicore virtualization platforms based on Intel's cache allocation technology (CAT), which achieves strong isolation at both task and VM levels through cache partition virtualization.
Abstract: This paper presents vCAT, a novel design for dynamic shared cache management on multicore virtualization platforms based on Intel's Cache Allocation Technology (CAT). Our design achieves strong isolation at both task and VM levels through cache partition virtualization, which works in a similar way as memory virtualization, but has challenges that are unique to cache and CAT. To demonstrate the feasibility and benefits of our design, we provide a prototype implementation of vCAT, and we present an extensive set of microbenchmarks and performance evaluation results on the PARSEC benchmarks and synthetic workloads, for both static and dynamic allocations. The evaluation results show that (i) vCAT can be implemented with minimal overhead, (ii) it can be used to mitigate shared cache interference, which could have caused task WCET increased by up to 7.2×, (iii) static management in vCAT can increase system utilization by up to 7× compared to a system without cache management, and (iv) dynamic management substantially outperforms static management in terms of schedulable utilization (increase by up to 3× in our multi-mode example use case).

Proceedings ArticleDOI
19 Mar 2017
TL;DR: This paper proposes an optimization framework for cache placement and delivery schemes which explicitly accounts for the heterogeneity of the cache sizes, and characterize explicitly the optimal caching scheme, for the case where the sum of the users' cache sizes is smaller than or equal to the library size.
Abstract: Coded caching can improve fundamental limits of communication, utilizing storage memory at individual users. This paper considers a centralized coded caching system, introducing heterogeneous cache sizes at the users, i.e., the users' cache memories are of different size. The goal is to design cache placement and delivery policies that minimize the worst-case delivery load on the server. To that end, the paper proposes an optimization framework for cache placement and delivery schemes which explicitly accounts for the heterogeneity of the cache sizes. We also characterize explicitly the optimal caching scheme, for the case where the sum of the users' cache sizes is smaller than or equal to the library size.

Proceedings ArticleDOI
24 Jun 2017
TL;DR: This paper proposes Access Pattern-aware Cache Management (APCM), which dynamically detects the locality type of each load instruction by monitoring the accesses from one exemplary warp, and uses the detected locality type to selectively apply cache bypassing and cache pinning of data based on load locality characterization.
Abstract: Long latency of memory operation is a prominent performance bottleneck in graphics processing units (GPUs). The small data cache that must be shared across dozens of warps (a collection of threads) creates significant cache contention and premature data eviction. Prior works have recognized this problem and proposed warp throttling which reduces the number of active warps contending for cache space. In this paper we discover that individual load instructions in a warp exhibit four different types of data locality behavior: (1) data brought by a warp load instruction is used only once, which is classified as streaming data (2) data brought by a warp load is reused multiple times within the same warp, called intra-warp locality (3) data brought by a warp is reused multiple times but across different warps, called inter-warp locality (4) and some data exhibit both a mix of intra- and inter-warp locality. Furthermore, each load instruction exhibits consistently the same locality type across all warps within a GPU kernel. Based on this discovery we argue that cache management must be done using per-load locality type information, rather than applying warp-wide cache management policies. We propose Access Pattern-aware Cache Management (APCM), which dynamically detects the locality type of each load instruction by monitoring the accesses from one exemplary warp. APCM then uses the detected locality type to selectively apply cache bypassing and cache pinning of data based on load locality characterization. Using an extensive set of simulations we show that APCM improves performance of GPUs by 34% for cache sensitive applications while saving 27% of energy consumption over baseline GPU.

Proceedings Article
12 Jul 2017
TL;DR: Web application performance heavily relies on the hit rate of DRAM key-value caches, and Memshare provides a resource sharing model that guarantees reserved memory to different applications while dynamically pooling and sharing the remaining memory to optimize overall hit rate.
Abstract: Web application performance heavily relies on the hit rate of DRAM key-value caches. Current DRAM caches statically partition memory across applications that share the cache. This results in under utilization and limits cache hit rates. We present Memshare, a DRAM key-value cache that dynamically manages memory across applications. Memshare provides a resource sharing model that guarantees reserved memory to different applications while dynamically pooling and sharing the remaining memory to optimize overall hit rate. Key-value caches are typically memory capacity bound, which leaves cache server CPU and memory bandwidth idle. Memshare leverages these resources with a log-structured design that allows it to provide better hit rates than conventional caches by dynamically repartitioning memory among applications. We implemented Memshare and ran it on a week-long trace from a commercial memcached provider. Memshare increases the combined hit rate of the applications in the trace from 84.7% to 90.8%, and it reduces the total number of misses by 39.7% without significantly affecting cache throughput or latency. Even for single-tenant applications, Memshare increases the average hit rate of the state-of-the-art key-value cache by an additional 2.7%.

Journal ArticleDOI
TL;DR: The CLCE replication scheme reduces the redundant caching of contents; hence improves the cache space utilization and LFRU approximates the least frequently used scheme coupled with the least recently used scheme and is practically implementable for rapidly changing cache networks like ICNs.
Abstract: To cope with the ongoing changing demands of the internet, ‘in-network caching’ has been presented as an application solution for two decades. With the advent of information-centric network (ICN) architecture, ‘in-network caching’ becomes a network level solution. Some unique features of the ICNs, e.g., rapidly changing cache states, higher request arrival rates, smaller cache sizes, and other factors, impose diverse requirements on the content eviction policies. In particular, eviction policies should be fast and lightweight. In this paper, we propose cache replication and eviction schemes, conditional leave cope everywhere (CLCE) and least frequent recently used (LFRU), which are well suited for the ICN type of cache networks (CNs). The CLCE replication scheme reduces the redundant caching of contents; hence improves the cache space utilization. LFRU approximates the least frequently used scheme coupled with the least recently used scheme and is practically implementable for rapidly changing cache networks like ICNs.

Posted Content
TL;DR: This paper investigates multi-layer caching where both base station and users are capable of storing content data in their local cache and analyzes the performance of edge-caching wireless networks under two notable uncoded and coded caching strategies.
Abstract: Edge-caching has received much attention as an efficient technique to reduce delivery latency and network congestion during peak-traffic times by bringing data closer to end users. Existing works usually design caching algorithms separately from physical layer design. In this paper, we analyse edge-caching wireless networks by taking into account the caching capability when designing the signal transmission. Particularly, we investigate multi-layer caching where both base station (BS) and users are capable of storing content data in their local cache and analyse the performance of edge-caching wireless networks under two notable uncoded and coded caching strategies. Firstly, we propose a coded caching strategy that is applied to arbitrary values of cache size. The required backhaul and access rates are derived as a function of the BS and user cache size. Secondly, closed-form expressions for the system energy efficiency (EE) corresponding to the two caching methods are derived. Based on the derived formulas, the system EE is maximized via precoding vectors design and optimization while satisfying a predefined user request rate. Thirdly, two optimization problems are proposed to minimize the content delivery time for the two caching strategies. Finally, numerical results are presented to verify the effectiveness of the two caching methods.

Proceedings ArticleDOI
01 Feb 2017
TL;DR: These results show that formalizing cache replacement yields practical benefits, and propose that practical policies should replace lines based on their economic value added (EVA), the difference of their expected hits from the average.
Abstract: Much prior work has studied cache replacement, but a large gap remains between theory and practice. The design of many practical policies is guided by the optimal policy, Belady's MIN. However, MIN assumes perfect knowledge of the future that is unavailable in practice, and the obvious generalizationsof MIN are suboptimal with imperfect information. What, then, is the right metric for practical cache replacement?We propose that practical policies should replace lines based on their economic value added (EVA), the difference of their expected hits from the average. Drawing on the theory of Markov decision processes, we discuss why this metric maximizes the cache's hit rate. We present an inexpensive implementation ofEVA and evaluate it exhaustively. EVA outperforms several prior policies and saves area at iso-performance. These results show that formalizing cache replacement yields practical benefits.

Journal ArticleDOI
TL;DR: A dynamic adaptive replacement policy (DARP) in the shared last-level cache for the DRAM/PCM hybrid main memory is proposed and results have shown that the DARP improved the memory access efficiency by 25.4%.
Abstract: The increasing demand on the main memory capacity is one of the main big data challenges. Dynamic random access memory (DRAM) does not represent the best choice for a main memory, due to high power consumption and low density. However, the nonvolatile memory, such as the phase-change memory (PCM), represents an additional choice because of the low power consumption and high-density characteristic. Nevertheless, the high access latency and limited write endurance have disabled the PCM to replace the DRAM currently. Therefore, a hybrid memory, which combines both the DRAM and the PCM, has become a good alternative to the traditional DRAM memory. Both DRAM and PCM disadvantages are challenges for the hybrid memory. In this paper, a dynamic adaptive replacement policy (DARP) in the shared last-level cache for the DRAM/PCM hybrid main memory is proposed. The DARP distinguishes the cache data into the PCM data and the DRAM data, then, the algorithm adopts different replacement policies for each data type. Specifically, for the PCM data, the least recently used (LRU) replacement policy is adopted, and for the DRAM data, the DARP is employed according to the process behavior. Experimental results have shown that the DARP improved the memory access efficiency by 25.4%.

Proceedings ArticleDOI
24 Jun 2017
TL;DR: Jenga is proposed, a reconfigurable cache hierarchy that dynamically and transparently specializes itself to applications, and builds virtual cache hierarchies out of heterogeneous, distributed cache banks using simple hardware mechanisms and an OS runtime.
Abstract: Caches are traditionally organized as a rigid hierarchy, with multiple levels of progressively larger and slower memories. Hierarchy allows a simple, fixed design to benefit a wide range of applications, since working sets settle at the smallest (i.e., fastest and most energy-efficient) level they fit in. However, rigid hierarchies also add overheads, because each level adds latency and energy even when it does not fit the working set. These overheads are expensive on emerging systems with heterogeneous memories, where the differences in latency and energy across levels are small. Significant gains are possible by specializing the hierarchy to applications.We propose Jenga, a reconfigurable cache hierarchy that dynamically and transparently specializes itself to applications. Jenga builds virtual cache hierarchies out of heterogeneous, distributed cache banks using simple hardware mechanisms and an OS runtime. In contrast to prior techniques that trade energy and bandwidth for performance (e.g., dynamic bypassing or prefetching), Jenga eliminates accesses to unwanted cache levels. Jenga thus improves both performance and energy efficiency. On a 36-core chip with a 1 GB DRAM cache, Jenga improves energy-delay product over a combination of state-of-the-art techniques by 23% on average and by up to 85%.

Proceedings ArticleDOI
14 Oct 2017
TL;DR: The technique is demonstrated using a placement, promotion, and bypass optimization that outperforms state-of-the-art policies using a low overhead and the accuracy of the multiperspective technique is superior to previous work.
Abstract: The disparity between last-level cache and memory latencies motivates the search for efficient cache management policies. Recent work in predicting reuse of cache blocks enables optimizations that significantly improve cache performance and e ciency. However, the accuracy of the prediction mechanisms limits the scope of optimization. This paper introduces multiperspective reuse prediction, a technique that predicts the future reuse of cache blocks using several different types of features. The accuracy of the multiperspective technique is superior to previous work. We demonstrate the technique using a placement, promotion, and bypass optimization that outperforms state-of-the-art policies using a low overhead. On a set of single-thread benchmarks, the technique yields a geometric mean 9.0% speedup over LRU, compared with 5.1% for Hawkeye and 6.3% for Perceptron. On multi-programmed workloads, the technique gives a geometric mean weighted speedup of 8.3% over LRU, compared with 5.2% for Hawkeye and 5.8% for Perceptron. CCS CONCEPTS • Computer systems organization $\rightarrow$ Multicore architectures; • Hardware $\rightarrow$ Static memory;

Proceedings ArticleDOI
05 Jun 2017
TL;DR: Extensive evaluation shows that compared with existing wireless network caching algorithms, the proposed algorithms significantly improve data caching fairness, while keeping the contention induced latency similar to the best existing algorithms.
Abstract: Edge devices (e.g., smartphones, tablets, connected vehicles, IoT nodes) with sensing, storage and communication resources are increasingly penetrating our environments. Many novel applications can be created when nearby peer edge devices share data. Caching can greatly improve the data availability, retrieval robustness and latency. In this paper, we study the unique issue of caching fairness in edge environment. Due to distinct ownership of peer devices, caching load balance is critical. We consider fairness metrics and formulate an integer linear programming problem, which is shown as summation of multiple Connected Facility Location (ConFL) problems. We propose an approximation algorithm leveraging an existing ConFL approximation algorithm, and prove that it preserves a 6.55 approximation ratio. We further develop a distributed algorithm where devices exchange data reachability and identify popular candidates as caching nodes. Extensive evaluation shows that compared with existing wireless network caching algorithms, our algorithms significantly improve data caching fairness, while keeping the contention induced latency similar to the best existing algorithms.

Journal ArticleDOI
TL;DR: It is found that although room for substantial improvement exists when comparing performance to that of a perfect “oracle” policy, such improvements are unlikely to be achievable in practice.
Abstract: The ephemeral content popularity seen with many content delivery applications can make indiscriminate on-demand caching in edge networks highly inefficient, since many of the content items that are added to the cache will not be requested again from that network. In this paper, we address the problem of designing and evaluating more selective edge-network caching policies. The need for such policies is demonstrated through an analysis of a dataset recording YouTube video requests from users on an edge network over a 20-month period. We then develop a novel workload modelling approach for such applications and apply it to study the performance of alternative edge caching policies, including indiscriminate caching and cache on $k$ th request for different $k$ . The latter policies are found able to greatly reduce the fraction of the requested items that are inserted into the cache, at the cost of only modest increases in cache miss rate. Finally, we quantify and explore the potential room for improvement from use of other possible predictors of further requests. We find that although room for substantial improvement exists when comparing performance to that of a perfect “oracle” policy, such improvements are unlikely to be achievable in practice.

Journal ArticleDOI
TL;DR: An integer programming problem is formulated to minimize the average download time under the constraint of cache size at each SBS, and it is shown that finding the optimal caching placement strategy is NP-hard.
Abstract: To alleviate the pressure brought by the explosion of mobile video traffic on present cellular networks, small cell base stations (SBSs) with caching ability are introduced In this letter, we consider the caching strategy of scalable video coding streaming over heterogeneous wireless network containing SBSs We formulate an integer programming problem to minimize the average download time under the constraint of cache size at each SBS, and show that finding the optimal caching placement strategy is NP-hard Heuristic solution was proposed based on the convex programming relaxation, which reveals the structural properties of cache allocation for each video Simulation results demonstrate that our proposed caching strategies acquire significant performance gain compared with conventional caching policy

Proceedings ArticleDOI
24 Jun 2017
TL;DR: DICE is proposed, a dynamic design that can adapt between spatial indexing and TSI, depending on the compressibility of the data, and low-cost Cache Index Predictors (CIP) that can accurately predict the cache indexing scheme on access in order to avoid probing both indices for retrieving a given cache line.
Abstract: This paper investigates compression for DRAM caches. As the capacity of DRAM cache is typically large, prior techniques on cache compression, which solely focus on improving cache capacity, provide only a marginal benefit. We show that more performance benefit can be obtained if the compression of the DRAM cache is tailored to provide higher bandwidth. If a DRAM cache can provide two compressed lines in a single access, and both lines are useful, the effective bandwidth of the DRAM cache would double. Unfortunately, it is not straight-forward to compress DRAM caches for bandwidth. The typically used Traditional Set Indexing (TSI) maps consecutive lines to consecutive sets, so the multiple compressed lines obtained from the set are from spatially distant locations and unlikely to be used within a short period of each other. We can change the indexing of the cache to place consecutive lines in the same set to improve bandwidth; however, when the data is incompressible, such spatial indexing reduces effective capacity and causes significant slowdown.Ideally, we would like to have spatial indexing when the data is compressible and TSI otherwise. To this end, we propose Dynamic-Indexing Cache comprEssion (DICE), a dynamic design that can adapt between spatial indexing and TSI, depending on the compressibility of the data. We also propose low-cost Cache Index Predictors (CIP) that can accurately predict the cache indexing scheme on access in order to avoid probing both indices for retrieving a given cache line. Our studies with a 1GB DRAM cache, on a wide range of workloads (including SPEC and Graph), show that DICE improves performance by 19.0% and reduces energy-delay-product by 36% on average. DICE is within 3% of a design that has double the capacity and double the bandwidth. DICE incurs a storage overhead of less than 1KB and does not rely on any OS support.

Proceedings ArticleDOI
04 Apr 2017
TL;DR: This paper proposes a holistic cache management technique called Kill-the-PC (KPC) that overcomes the weaknesses of traditional prefetching and replacement policy algorithms and removes the need to propagate the PC through entire on-chip cache hierarchy while providing a holistic caches management approach with better performance.
Abstract: Data prefetching and cache replacement algorithms have been intensively studied in the design of high performance microprocessors. Typically, the data prefetcher operates in the private caches and does not interact with the replacement policy in the shared Last-Level Cache (LLC). Similarly, most replacement policies do not consider demand and prefetch requests as different types of requests. In particular, program counter (PC)-based replacement policies cannot learn from prefetch requests since the data prefetcher does not generate a PC value. PC-based policies can also be negatively affected by compiler optimizations. In this paper, we propose a holistic cache management technique called Kill-the-PC (KPC) that overcomes the weaknesses of traditional prefetching and replacement policy algorithms. KPC cache management has three novel contributions. First, a prefetcher which approximates the future use distance of prefetch requests based on its prediction confidence. Second, a simple replacement policy provides similar or better performance than current state-of-the-art PC-based prediction using global hysteresis. Third, KPC integrates prefetching and replacement policy into a whole system which is greater than the sum of its parts. Information from the prefetcher is used to improve the performance of the replacement policy and vice-versa. Finally, KPC removes the need to propagate the PC through entire on-chip cache hierarchy while providing a holistic cache management approach with better performance than state-of-the-art PC-, and non-PC-based schemes. Our evaluation shows that KPC provides 8% better performance than the best combination of existing prefetcher and replacement policy for multi-core workloads.