scispace - formally typeset
Search or ask a question

Showing papers on "Smart Cache published in 2016"


Proceedings ArticleDOI
12 Mar 2016
TL;DR: CATalyst, a pseudo-locking mechanism which uses CAT to partition the LLC into a hybrid hardware-software managed cache, is presented, and it is shown that LLC side channel attacks can be defeated.
Abstract: Cache side channel attacks are serious threats to multi-tenant public cloud platforms. Past work showed how secret information in one virtual machine (VM) can be extracted by another co-resident VM using such attacks. Recent research demonstrated the feasibility of high-bandwidth, low-noise side channel attacks on the last-level cache (LLC), which is shared by all the cores in the processor package, enabling attacks even when VMs are scheduled on different cores. This paper shows how such LLC side channel attacks can be defeated using a performance optimization feature recently introduced in commodity processors. Since most cloud servers use Intel processors, we show how the Intel Cache Allocation Technology (CAT) can be used to provide a system-level protection mechanism to defend from side channel attacks on the shared LLC. CAT is a way-based hardware cache-partitioning mechanism for enforcing quality-of-service with respect to LLC occupancy. However, it cannot be directly used to defeat cache side channel attacks due to the very limited number of partitions it provides. We present CATalyst, a pseudo-locking mechanism which uses CAT to partition the LLC into a hybrid hardware-software managed cache. We implement a proof-of-concept system using Xen and Linux running on a server with Intel processors, and show that LLC side channel attacks can be defeated. Furthermore, CATalyst only causes very small performance overhead when used for security, and has negligible impact on legacy applications.

360 citations


Journal ArticleDOI
TL;DR: A new caching scheme that combines two basic approaches to provide coded multicasting opportunities within each layer and across multiple layers is proposed, which achieves the optimal communication rates to within a constant multiplicative and additive gap.
Abstract: caching of popular content during off-peak hours is a strategy to reduce network loads during peak hours. Recent work has shown significant benefits of designing such caching strategies not only to locally deliver the part of the content, but also to provide coded multicasting opportunities even among users with different demands. Exploiting both of these gains was shown to be approximately optimal for caching systems with a single layer of caches. Motivated by practical scenarios, we consider, in this paper, a hierarchical content delivery network with two layers of caches. We propose a new caching scheme that combines two basic approaches. The first approach provides coded multicasting opportunities within each layer; the second approach provides coded multicasting opportunities across multiple layers. By striking the right balance between these two approaches, we show that the proposed scheme achieves the optimal communication rates to within a constant multiplicative and additive gap. We further show that there is no tension between the rates in each of the two layers up to the aforementioned gap. Thus, both the layers can simultaneously operate at approximately the minimum rate.

197 citations


Proceedings ArticleDOI
11 Sep 2016
TL;DR: In this article, when the cache contents and the user demands are fixed, the authors connect the caching problem to an index coding problem and show the optimality of the MAN scheme under the conditions that the cache placement phase is restricted to be uncoded (i.e., pieces of the files can only copied into the user's cache).
Abstract: Caching is an effective way to reduce peak-hour network traffic congestion by storing some contents at user's local cache. Maddah-Ali and Niesen (MAN) initiated a fundamental study of caching systems by proposing a scheme (with uncoded cache placement and linear network coding delivery) that is provably optimal to within a factor 4.7. In this paper, when the cache contents and the user demands are fixed, we connect the caching problem to an index coding problem and show the optimality of the MAN scheme under the conditions that (i) the cache placement phase is restricted to be uncoded (i.e, pieces of the files can only copied into the user's cache), and (ii) the number of users is no more than the number of files. As a consequence, further improvements to the MAN scheme are only possible through the use of coded cache placement.

188 citations


Proceedings ArticleDOI
10 Apr 2016
TL;DR: It is proved that the learning regret of PopCaching is sublinear in the number of content requests, and it is converges fast and asymptotically achieves the optimal cache hit rate.
Abstract: This paper presents a novel cache replacement method — Popularity-Driven Content Caching (PopCaching). PopCaching learns the popularity of content and uses it to determine which content it should store and which it should evict from the cache. Popularity is learned in an online fashion, requires no training phase and hence, it is more responsive to continuously changing trends of content popularity. We prove that the learning regret of PopCaching (i.e., the gap between the hit rate achieved by PopCaching and that by the optimal caching policy with hindsight) is sublinear in the number of content requests. Therefore, PopCaching converges fast and asymptotically achieves the optimal cache hit rate. We further demonstrate the effectiveness of PopCaching by applying it to a movie.douban.com dataset that contains over 38 million requests. Our results show significant cache hit rate lift compared to existing algorithms, and the improvements can exceed 40% when the cache capacity is limited. In addition, PopCaching has low complexity.

146 citations


Proceedings ArticleDOI
10 Apr 2016
TL;DR: This paper proposes utility-driven caching, where each content is associate with each content a utility, which is a function of the corresponding content hit probability, and develops online algorithms that can be used by service providers to implement various caching policies based on arbitrary utility functions.
Abstract: In any caching system, the admission and eviction policies determine which contents are added and removed from a cache when a miss occurs. Usually, these policies are devised so as to mitigate staleness and increase the hit probability. Nonetheless, the utility of having a high hit probability can vary across contents. This occurs, for instance, when service level agreements must be met, or if certain contents are more difficult to obtain than others. In this paper, we propose utility-driven caching, where we associate with each content a utility, which is a function of the corresponding content hit probability. We formulate optimization problems where the objectives are to maximize the sum of utilities over all contents. These problems differ according to the stringency of the cache capacity constraint. Our framework enables us to reverse engineer classical replacement policies such as LRU and FIFO, by computing the utility functions that they maximize. We also develop online algorithms that can be used by service providers to implement various caching policies based on arbitrary utility functions.

142 citations


Proceedings ArticleDOI
12 Mar 2016
TL;DR: This paper will present the first set of shared cache QoS techniques designed and implemented in state-of-the-art commercial servers (the Intel® Xeon® processor E5-2600 v3 product family) and describe two key technologies: Cache Monitoring Technology (CMT) to enable monitoring of shared caches usage by different applications and Cache Allocation Technology(CAT) which enables redistribution of shared Cache space between applications to address contention.
Abstract: Over the last decade, addressing quality of service (QoS) in multi-core server platforms has been growing research topic. QoS techniques have been proposed to address the shared resource contention between co-running applications or virtual machines in servers and thereby provide better isolation, performance determinism and potentially improve overall throughput. One of the most important shared resources is cache space. Most proposals for addressing shared cache contention are based on simulations and analysis and no commercial platforms were available that integrated such techniques and provided a practical solution. In this paper, we will present the first set of shared cache QoS techniques designed and implemented in state-of-the-art commercial servers (the Intel® Xeon® processor E5-2600 v3 product family). We will describe two key technologies: (i) Cache Monitoring Technology (CMT) to enable monitoring of shared cache usage by different applications and (ii) Cache Allocation Technology (CAT) which enables redistribution of shared cache space between applications to address contention. This is the first paper to describing these techniques as they moved from concept to reality, starting from early research to product implementation. We will also present case studies highlighting the value of these techniques using example scenarios of multi-programmed workloads, virtualized platforms in datacenters and communications platforms. Finally, we will describe initial software infrastructure and enabling for industry practitioners and researchers to take advantage of these technologies for their QoS needs.

136 citations


Proceedings ArticleDOI
10 Apr 2016
TL;DR: It is shown that the problem of finding the caching configuration of video encoding layers that minimizes average delay for a network operator is NP-Hard, and a pseudopolynomial-time optimal solution is established using a connection with the multiple-choice knapsack problem.
Abstract: Distributed caching architectures have been proposed for bringing content close to requesters and the key problem is to design caching algorithms for reducing content delivery delay. The problem obtains an interesting new twist with the advent of advanced layered-video encoding techniques such as Scalable Video Coding (SVC). We show that the problem of finding the caching configuration of video encoding layers that minimizes average delay for a network operator is NP-Hard, and we establish a pseudopolynomial-time optimal solution using a connection with the multiple-choice knapsack problem. We also design caching algorithms for multiple operators that cooperate by pooling together their co-located caches, in an effort to aid each other, so as to avoid large delays due to downloading content from distant servers. We derive an approximate solution to this cooperative caching problem using a technique that partitions the cache capacity into amounts dedicated to own and others' caching needs. Numerical results based on real traces of SVC-encoded videos demonstrate up to 25% reduction in delay over existing (layer-agnostic) caching schemes, with increasing gains as the video popularity distribution gets steeper, and cache capacity increases.

129 citations


Journal ArticleDOI
TL;DR: Both the fundamental performance limits of cache networks and the practical challenges that need to be overcome in real-life scenarios are discussed.
Abstract: Caching is an essential technique to improve throughput and latency in a vast variety of applications. The core idea is to duplicate content in memories distributed across the network, which can then be exploited to deliver requested content with less congestion and delay. The traditional role of cache memories is to deliver the maximal amount of requested content locally rather than from a remote server. While this approach is optimal for single-cache systems, it has recently been shown to be significantly suboptimal for systems with multiple caches (i.e., cache networks). Instead, cache memories should be used to enable a coded multicasting gain. In this article, we survey these recent developments. We discuss both the fundamental performance limits of cache networks and the practical challenges that need to be overcome in real-life scenarios.

107 citations


Proceedings ArticleDOI
Jun Rao1, Hao Feng1, Chenchen Yang1, Zhiyong Chen1, Bin Xia1 
22 May 2016
TL;DR: The placement problem for the helper-tier caching network is reduced to a convex problem, and can be effectively solved by the classical water-filling method, and notices that users and helpers prefer to cache popular contents under low node density and prefer to caches different contents evenly under high node density.
Abstract: In this paper, we devise the optimal caching placement to maximize the offloading probability for a two-tier wireless caching system, where the helpers and a part of users have caching ability. The offloading comes from the local caching, D2D sharing and the helper transmission. In particular, to maximize the offloading probability we reformulate the caching placement problem for users and helpers into a difference of convex (DC) problem which can be effectively solved by DC programming. Moreover, we analyze the two extreme cases where there is only help-tier caching network and only user-tier. Specifically, the placement problem for the helper-tier caching network is reduced to a convex problem, and can be effectively solved by the classical water-filling method. We notice that users and helpers prefer to cache popular contents under low node density and prefer to cache different contents evenly under high node density. Simulation results indicate a great performance gain of the proposed caching placement over existing approaches.

106 citations


Proceedings Article
16 Mar 2016
TL;DR: Cliffhanger is developed, a lightweight iterative algorithm that runs on memory cache servers, which incrementally optimizes the resource allocations across and within applications based on dynamically changing workloads and designs a novel technique for dealing with performance cliffs incrementally and locally.
Abstract: Web-scale applications are heavily reliant on memory cache systems such as Memcached to improve throughput and reduce user latency. Small performance improvements in these systems can result in large end-to-end gains. For example, a marginal increase in hit rate of 1% can reduce the application layer latency by over 35%. However, existing web cache resource allocation policies are workload oblivious and first-come-first-serve. By analyzing measurements from a widely used caching service, Memcachier, we demonstrate that existing cache allocation techniques leave significant room for improvement. We develop Cliffhanger, a lightweight iterative algorithm that runs on memory cache servers, which incrementally optimizes the resource allocations across and within applications based on dynamically changing workloads. It has been shown that cache allocation algorithms underperform when there are performance cliffs, in which minor changes in cache allocation cause large changes in the hit rate. We design a novel technique for dealing with performance cliffs incrementally and locally. We demonstrate that for the Memcachier applications, on average, Cliffhanger increases the overall hit rate 1.2%, reduces the total number of cache misses by 36.7% and achieves the same hit rate with 45% less memory capacity.

99 citations


Journal ArticleDOI
TL;DR: It is proved that the gap between the hit rate achieved by Trend-Caching and that by the optimal caching policy with hindsight is sublinear in the number of video requests, thereby guaranteeing both fast convergence and asymptotically optimal cache hit rate.
Abstract: This paper presents Trend-Caching, a novel cache replacement method that optimizes cache performance according to the trends of video content. Trend-Caching explicitly learns the popularity trend of video content and uses it to determine which video it should store and which it should evict from the cache. Popularity is learned in an online fashion and requires no training phase, hence it is more responsive to continuously changing trends of videos. We prove that the learning regret of Trend-Caching (i.e., the gap between the hit rate achieved by Trend-Caching and that by the optimal caching policy with hindsight) is sublinear in the number of video requests, thereby guaranteeing both fast convergence and asymptotically optimal cache hit rate. We further validate the effectiveness of Trend-Caching by applying it to a movie.douban.com dataset that contains over 38 million requests. Our results show significant cache hit rate lift compared to existing algorithms, and the improvements can exceed 40% when the cache capacity is limited. Furthermore, Trend-Caching has low complexity.

Journal ArticleDOI
TL;DR: An improved design of Newcache is presented, in terms of security, circuit design and simplicity, and it is shown that Newcache can be used as L1 data and instruction caches to improve security without impacting performance.
Abstract: Newcache is a secure cache that can thwart cache side-channel attacks to prevent the leakage of secret information All caches today are susceptible to cache side-channel attacks, despite software isolation of memory pages in virtual address spaces or virtual machines These cache attacks can leak secret encryption keys or private identity keys, nullifying any protection provided by strong cryptography Newcache uses a novel dynamic, randomized memory-to-cache mapping to thwart contention-based side-channel attacks, rather than the static mapping used by conventional set-associative caches In this article, the authors present an improved design of Newcache, in terms of security, circuit design and simplicity They show Newcache's security against a suite of cache side-channel attacks They evaluate Newcache's system performance for cloud computing, smartphone, and SPEC benchmarks and find that Newcache performs as well as conventional set-associative caches, and sometimes better They also designed a VLSI test chip with a 32-Kbyte Newcache and a 32-Kbyte, eight-way, set-associative cache and verified that the access latency, power, and area of the two caches are comparable These results show that Newcache can be used as L1 data and instruction caches to improve security without impacting performance

Journal ArticleDOI
18 Jun 2016
TL;DR: This work proposes adding just enough functionality to dynamically identify instructions at the core and migrate them to the memory controller for execution as soon as source data arrives from DRAM, allowing memory requests issued by the new Enhanced Memory Controller to experience a 20% lower latency than ifissued by the core.
Abstract: On-chip contention increases memory access latency for multicore processors. We identify that this additional latency has a substantial efect on performance for an important class of latency-critical memory operations: those that result in a cache miss and are dependent on data from a prior cache miss. We observe that the number of instructions between the frst cache miss and its dependent cache miss is usually small. To minimize dependent cache miss latency, we propose adding just enough functionality to dynamically identify these instructions at the core and migrate them to the memory controller for execution as soon as source data arrives from DRAM. This migration allows memory requests issued by our new Enhanced Memory Controller (EMC) to experience a 20% lower latency than if issued by the core. On a set of memory intensive quad-core workloads, the EMC results in a 13% improvement in system performance and a 5% reduction in energy consumption over a system with a Global History Bufer prefetcher, the highest performing prefetcher in our evaluation.

Proceedings ArticleDOI
05 Jun 2016
TL;DR: The proposed SecDCP scheme changes the size of cache partitions at run time for better performance while preventing insecure information leakage between processes, and improves performance by up to 43% and by an average of 12.5% over static cache partitioning.
Abstract: In today's multicore processors, the last-level cache is often shared by multiple concurrently running processes to make efficient use of hardware resources. However, previous studies have shown that a shared cache is vulnerable to timing channel attacks that leak confidential information from one process to another. Static cache partitioning can eliminate the cache timing channels but incurs significant performance overhead. In this paper, we propose Secure Dynamic Cache Partitioning (SecDCP), a partitioning technique that defeats cache timing channel attacks. The SecDCP scheme changes the size of cache partitions at run time for better performance while preventing insecure information leakage between processes. For cache-sensitive multiprogram workloads, our experimental results show that SecDCP improves performance by up to 43% and by an average of 12.5% over static cache partitioning.

Proceedings ArticleDOI
22 May 2016
TL;DR: CaSE utilizes TrustZone and Cache-as-RAM technique to create a cache-based isolated execution environment, which can protect both code and data of security-sensitive applications against the compromised OS and the cold boot attack.
Abstract: Recognizing the pressing demands to secure embedded applications, ARM TrustZone has been adopted in both academic research and commercial products to protect sensitive code and data in a privileged, isolated execution environment. However, the design of TrustZone cannot prevent physical memory disclosure attacks such as cold boot attack from gaining unrestricted read access to the sensitive contents in the dynamic random access memory (DRAM). A number of system-on-chip (SoC) bound execution solutions have been proposed to thaw the cold boot attack by storing sensitive data only in CPU registers, CPU cache or internal RAM. However, when the operating system, which is responsible for creating and maintaining the SoC-bound execution environment, is compromised, all the sensitive data is leaked. In this paper, we present the design and development of a cache-assisted secure execution framework, called CaSE, on ARM processors to defend against sophisticated attackers who can launch multi-vector attacks including software attacks and hardware memory disclosure attacks. CaSE utilizes TrustZone and Cache-as-RAM technique to create a cache-based isolated execution environment, which can protect both code and data of security-sensitive applications against the compromised OS and the cold boot attack. To protect the sensitive code and data against cold boot attack, applications are encrypted in memory and decrypted only within the processor for execution. The memory separation and the cache separation provided by TrustZone are used to protect the cached applications against compromised OS. We implement a prototype of CaSE on the i.MX53 running ARM Cortex-A8 processor. The experimental results show that CaSE incurs small impacts on system performance when executing cryptographic algorithms including AES, RSA, and SHA1.

Journal ArticleDOI
TL;DR: This work focuses on the cache allocation problem, namely, how to distribute the cache capacity across routers under a constrained total storage budget for the network and proposes a suboptimal heuristic method based on node centrality, which is more practical in dynamic networks with frequent content publishing.
Abstract: Content-centric networking (CCN) is a promising framework to rebuild the Internet's forwarding substrate around the concept of content. CCN advocates ubiquitous in-network caching to enhance content delivery, and thus each router has storage space to cache frequently requested content. In this work, we focus on the cache allocation problem, namely, how to distribute the cache capacity across routers under a constrained total storage budget for the network. We first formulate this problem as a content placement problem and obtain the optimal solution by a two-step method. We then propose a suboptimal heuristic method based on node centrality, which is more practical in dynamic networks with frequent content publishing. We investigate through simulations the factors that affect the optimal cache allocation, and perhaps more importantly we use a real-life Internet topology and video access logs from a large scale Internet video provider to evaluate the performance of various cache allocation methods. We observe that network topology and content popularity are two important factors that affect where exactly should cache capacity be placed. Further, the heuristic method comes with only a very limited performance penalty compared to the optimal allocation. Finally, using our findings, we provide recommendations for network operators on the best deployment of CCN caches capacity over routers.

Posted Content
TL;DR: In this article, a cooperative hierarchical caching (CHC) framework is introduced where contents are jointly cached at the Base Band Unit (BBU) and at the Radio Remote Heads (RRHs).
Abstract: Over the last few years, Cloud Radio Access Network (C-RAN) has arisen as a transformative architecture for 5G cellular networks that brings the flexibility and agility of cloud computing to wireless communications. At the same time, content caching in wireless networks has become an essential solution to lower the content-access latency and backhaul traffic loading, which translate into user Quality of Experience (QoE) improvement and network cost reduction. In this article, a novel Cooperative Hierarchical Caching (CHC) framework in C-RAN is introduced where contents are jointly cached at the BaseBand Unit (BBU) and at the Radio Remote Heads (RRHs). Unlike in traditional approaches, the cache at the BBU, cloud cache, presents a new layer in the cache hierarchy, bridging the latency/capacity gap between the traditional edge-based and core-based caching schemes. Trace-driven simulations reveal that CHC yields up to 80% improvement in cache hit ratio, 21% decrease in average content-access latency, and 20% reduction in backhaul traffic load compared to the edge-only caching scheme with the same total cache capacity. Before closing the article, several challenges and promising opportunities for deploying content caching in C-RAN are highlighted towards a content-centric mobile wireless network.

Proceedings ArticleDOI
22 May 2016
TL;DR: Simulation results show that the proposed low-complexity distributed algorithm can greatly reduce the average download delay by collaborative caching and transmissions.
Abstract: Caching popular contents at the edge of wireless networks has recently emerged as a promising technology to improve the quality of service for mobile users, while balancing the peak-to-average transmissions over backhaul links. In contrast to existing works, where a central coordinator is required to design the cache placement strategy, we consider a distributed caching problem which is highly relevant in dense network settings. In the considered scenario, each Base Station (BS) has a cache storage of finite capacity, and each user will be served by one or multiple BSs depending on the employed transmission scheme. A belief propagation based distributed algorithm is proposed to solve the cache placement problem, where the parallel computations are performed by individual BSs based on limited local information and very few messages passed between neighboring BSs. Thus, no central coordinator is required to collect the information of the whole network, which significantly saves signaling overhead. Simulation results show that the proposed low-complexity distributed algorithm can greatly reduce the average download delay by collaborative caching and transmissions.

Proceedings ArticleDOI
01 Nov 2016
TL;DR: This paper proposes CacheOptimizer, a lightweight approach that helps developers optimize the configuration of caching frameworks for web applications that are implemented using Hibernate, which improves the throughput by 27--138%; and finds that after considering both the memory cost and throughput improvement, Cacheoptimizer still brings statistically significant gains.
Abstract: To help improve the performance of database-centric cloud-based web applications, developers usually use caching frameworks to speed up database accesses. Such caching frameworks require extensive knowledge of the application to operate effectively. However, all too often developers have limited knowledge about the intricate details of their own application. Hence, most developers find configuring caching frameworks a challenging and time-consuming task that requires extensive and scattered code changes. Furthermore, developers may also need to frequently change such configurations to accommodate the ever changing workload. In this paper, we propose CacheOptimizer, a lightweight approach that helps developers optimize the configuration of caching frameworks for web applications that are implemented using Hibernate. CacheOptimizer leverages readily-available web logs to create mappings between a workload and database accesses. Given the mappings, CacheOptimizer discovers the optimal cache configuration using coloured Petri nets, and automatically adds the appropriate cache configurations to the application. We evaluate CacheOptimizer on three open-source web applications. We find that i) CacheOptimizer improves the throughput by 27--138%; and ii) after considering both the memory cost and throughput improvement, CacheOptimizer still brings statistically significant gains (with mostly large effect sizes) in comparison to the application's default cache configuration and to blindly enabling all possible caches.

Proceedings ArticleDOI
24 Oct 2016
TL;DR: A novel construction of flush-reload side channels on last-level caches of ARM processors, which, particularly, exploits return-oriented programming techniques to reload instructions is demonstrated.
Abstract: Cache side-channel attacks have been extensively studied on x86 architectures, but much less so on ARM processors. The technical challenges to conduct side-channel attacks on ARM, presumably, stem from the poorly documented ARM cache implementations, such as cache coherence protocols and cache flush operations, and also the lack of understanding of how different cache implementations will affect side-channel attacks. This paper presents a systematic exploration of vectors for flush-reload attacks on ARM processors. flush-reload attacks are among the most well-known cache side-channel attacks on x86. It has been shown in previous work that they are capable of exfiltrating sensitive information with high fidelity. We demonstrate in this work a novel construction of flush-reload side channels on last-level caches of ARM processors, which, particularly, exploits return-oriented programming techniques to reload instructions. We also demonstrate several attacks on Android OS (e.g., detecting hardware events and tracing software execution paths) to highlight the implications of such attacks for Android devices.

Proceedings ArticleDOI
03 Jul 2016
TL;DR: This paper model a wireless cellular network using stochastic geometry and analyzes the performance of two network architectures, namely caching at the mobile device allowing device-to-device (D2D) connectivity and local caches at the radio access network edge (small cells).
Abstract: Proximity-based content caching and distribution in wireless networks has been identified as a promising traffic offloading solution for improving the capacity and the quality of experience (QoE) by exploiting content popularity and spatiotemporal request correlation. In this paper, we address the following question: where should popular content be cached in a wireless network? For that, we model a wireless cellular network using stochastic geometry and analyze the performance of two network architectures, namely caching at the mobile device allowing device-to-device (D2D) connectivity and local caching at the radio access network edge (small cells). We provide analytical and numerical results to compare their performance in terms of the cache hit probability, the density of cache-served requests and average power consumption. Our results reveal that the performance of cache-enabled networks with either D2D caching or small cell caching heavily depends on the user density and the content popularity distribution.

Journal ArticleDOI
TL;DR: This article provides a survey on static cache analysis for real-time systems, presenting the challenges and static analysis techniques for independent programs with respect to different cache features, followed by a survey of existing tools based on static techniques for cache analysis.
Abstract: Real-time systems are reactive computer systems that must produce their reaction to a stimulus within given time bounds. A vital verification requirement is to estimate the Worst-Case Execution Time (WCET) of programs. These estimates are then used to predict the timing behavior of the overall system. The execution time of a program heavily depends on the underlying hardware, among which cache has the biggest influence. Analyzing cache behavior is very challenging due to the versatile cache features and complex execution environment. This article provides a survey on static cache analysis for real-time systems. We first present the challenges and static analysis techniques for independent programs with respect to different cache features. Then, the discussion is extended to cache analysis in complex execution environment, followed by a survey of existing tools based on static techniques for cache analysis. An outlook for future research is provided at last.

Proceedings Article
22 Feb 2016
TL;DR: The paper proposes a new cache demand model, Reuse Working Set (RWS), to capture only the data with good temporal locality, and uses the RWS size (RWSS) to model a workload's cache demand.
Abstract: Host-side flash caching has emerged as a promising solution to the scalability problem of virtual machine (VM) storage in cloud computing systems, but it still faces serious limitations in capacity and endurance. This paper presents CloudCache, an on-demand cache management solution to meet VM cache demands and minimize cache wear-out. First, to support on-demand cache allocation, the paper proposes a new cache demand model, Reuse Working Set (RWS), to capture only the data with good temporal locality, and uses the RWS size (RWSS) to model a workload's cache demand. By predicting the RWSS online and admitting only RWS into the cache, CloudCache satisfies the workload's actual cache demand and minimizes the induced wear-out. Second, to handle situations where a cache is insufficient for the VMs' demands, the paper proposes a dynamic cache migration approach to balance cache load across hosts by live migrating cached data along with the VMs. It includes both on-demand migration of dirty data and background migration of RWS to optimize the performance of the migrating VM. It also supports rate limiting on the cache data transfer to limit the impact to the co-hosted VMs. Finally, the paper presents comprehensive experimental evaluations using real-world traces to demonstrate the effectiveness of CloudCache.

Journal ArticleDOI
TL;DR: A novel cache management algorithm for flash-based disk cache, named Lazy Adaptive Replacement Cache (LARC), which can filter out seldom accessed blocks and prevent them from entering cache and improves performance and extends SSD lifetime at the same time.
Abstract: For years, the increasing popularity of flash memory has been changing storage systems. Flash-based solid-state drives (SSDs) are widely used as a new cache tier on top of hard disk drives (HDDs) to speed up data-intensive applications. However, the endurance problem of flash memory remains a concern and is getting worse with the adoption of MLC and TLC flash. In this article, we propose a novel cache management algorithm for flash-based disk cache named Lazy Adaptive Replacement Cache (LARC). LARC adopts the idea of selective caching to filter out seldom accessed blocks and prevent them from entering cache. This avoids cache pollution and preserves popular blocks in cache for a longer period of time, leading to a higher hit rate. Meanwhile, by avoiding unnecessary cache replacements, LARC reduces the volume of data written to the SSD and yields an SSD-friendly access pattern. In this way, LARC improves the performance and endurance of the SSD at the same time. LARC is self-tuning and incurs little overhead. It has been extensively evaluated by both trace-driven simulations and synthetic benchmarks on a prototype implementation. Our experiments show that LARC outperforms state-of-art algorithms for different kinds of workloads and extends SSD lifetime by up to 15.7 times.

Journal ArticleDOI
TL;DR: This paper uses a cover-set approach to solve the rule dependency problem and cache important rules to TCAM and proposes a rule cache replacement algorithm considering the temporal and spatial traffic localities.
Abstract: In software-defined networking, flow tables of OpenFlow switches are implemented by ternary content addressable memory (TCAM). Although TCAM can process input packets in high speed, it is a scarce and expensive resource providing only a few thousands of rule entries on a network switch. Rules caching is a technique to solve the TCAM capacity problem. However, the rule dependency problem is a challenging issue for wildcard rules caching where packets can mismatch rules. In this paper, we use a cover-set approach to solve the rule dependency problem and cache important rules to TCAM. We also propose a rule cache replacement algorithm considering the temporal and spatial traffic localities. Simulation results show that our algorithms have better cache hit ratio than previous works.

Proceedings ArticleDOI
12 Mar 2016
TL;DR: A new probabilistic cache model designed for high-performance replacement policies is presented that uses absolute reuse distances instead of stack distances, and models replacement policies as abstract ranking functions, which let us model arbitrary age-based replacement policies.
Abstract: Modern processors use high-performance cache replacement policies that outperform traditional alternatives like least-recently used (LRU). Unfortunately, current cache models do not capture these high-performance policies as most use stack distances, which are inherently tied to LRU or its variants. Accurate predictions of cache performance enable many optimizations in multicore systems. For example, cache partitioning uses these predictions to divide capacity among applications in order to maximize performance, guarantee quality of service, or achieve other system objectives. Without an accurate model for high-performance replacement policies, these optimizations are unavailable to modern processors. We present a new probabilistic cache model designed for high-performance replacement policies. It uses absolute reuse distances instead of stack distances, and models replacement policies as abstract ranking functions. These innovations let us model arbitrary age-based replacement policies. Our model achieves median error of less than 1% across several high-performance policies on both synthetic and SPEC CPU2006 benchmarks. Finally, we present a case study showing how to use the model to improve shared cache performance.

Posted Content
TL;DR: In this paper, a coded delivery strategy is proposed to profit from the multicasting opportunities that arise because a file may be demanded by multiple users, and the proposed delivery method is proved to be optimal under the constraint of uncoded placement for centralized systems with two files, moreover it outperforms known caching strategies for both centralized and decentralized systems.
Abstract: Caching appears to be an efficient way to reduce peak hour network traffic congestion by storing some content at the user's cache without knowledge of later demands. Recently, Maddah-Ali and Niesen proposed a two-phase, placement and delivery phase, coded caching strategy for centralized systems (where coordination among users is possible in the placement phase), and for decentralized systems. This paper investigates the same setup under the further assumption that the number of users is larger than the number of files. By using the same uncoded placement strategy of Maddah-Ali and Niesen, a novel coded delivery strategy is proposed to profit from the multicasting opportunities that arise because a file may be demanded by multiple users. The proposed delivery method is proved to be optimal under the constraint of uncoded placement for centralized systems with two files, moreover it is shown to outperform known caching strategies for both centralized and decentralized systems.

Proceedings Article
22 Feb 2016
TL;DR: The paper presents a rigorous analysis of the algorithms to prove that they do not waste valuable cache space and that they are efficient in time and space usage, and shows that they can be extended to support the combination of compression and deduplication for flash caching and improve its performance and endurance.
Abstract: Flash caching has emerged as a promising solution to the scalability problems of storage systems by using fast flash memory devices as the cache for slower primary storage. But its adoption faces serious obstacles due to the limited capacity and endurance of flash devices. This paper presents CacheDedup, a solution that addresses these limitations using in-line deduplication. First, it proposes a novel architecture that integrates the caching of data and deduplication metadata (source addresses and fingerprints of the data) and efficiently manages these two components. Second, it proposes duplication-aware cache replacement algorithms (D-LRU, DARC) to optimize both cache performance and endurance. The paper presents a rigorous analysis of the algorithms to prove that they do not waste valuable cache space and that they are efficient in time and space usage. The paper also includes an experimental evaluation using real-world traces, which confirms that CacheDedup substantially improves I/O performance (up to 20% reduction in miss ratio and 51% in latency) and flash endurance (up to 89% reduction in writes sent to the cache device) compared to traditional cache management. It also shows that the proposed architecture and algorithms can be extended to support the combination of compression and deduplication for flash caching and improve its performance and endurance.

Proceedings Article
22 Jun 2016
TL;DR: Experimental results show that Ginseng for cache allocation improved clients' aggregated benefit by up to 42.8× compared with state-of-the-art static and dynamic algorithms.
Abstract: Cloud providers must dynamically allocate their physical resources to the right client to maximize the benefit that they can get out of given hardware. Cache Allocation Technology (CAT) makes it possible for the provider to allocate last level cache to virtual machines to prevent cache pollution. The provider can also allocate the cache to optimize client benefit. But how should it optimize client benefit, when it does not even know what the client plans to do? We present an auction-based mechanism that dynamically allocates cache while optimizing client benefit and improving hardware utilization. We evaluate our mechanism on benchmarks from the Phoronix Test Suite. Experimental results show that Ginseng for cache allocation improved clients' aggregated benefit by up to 42.8× compared with state-of-the-art static and dynamic algorithms.

Journal ArticleDOI
TL;DR: A write intensity predictor that realizes the idea by exploiting a correlation between write intensity of blocks and memory access instructions that incur cache misses of those blocks is designed and a hybrid cache architecture in which write-intensive blocks identified by the predictor are placed into the SRAM region is designed.
Abstract: Spin-transfer torque RAM (STT-RAM) has emerged as an energy-efficient and high-density alternative to SRAM for large on-chip caches. However, its high write energy has been considered as a serious drawback. Hybrid caches mitigate this problem by incorporating a small SRAM cache for write-intensive data along with an STT-RAM cache. In such architectures, choosing cache blocks to be placed into the SRAM cache is the key to their energy efficiency. This paper proposes a new hybrid cache architecture called prediction hybrid cache. The key idea is to predict write intensity of cache blocks at the time of cache misses and determine block placement based on the prediction. We design a write intensity predictor that realize the idea by exploiting a correlation between write intensity of blocks and memory access instructions that incur cache misses of those blocks. It includes a mechanism to dynamically adapt the predictor to application characteristics. We also design a hybrid cache architecture in which write-intensive blocks identified by the predictor are placed into the SRAM region. Evaluations show that our scheme reduces energy consumption of hybrid caches by 28 percent (31 percent) on average compared to the existing hybrid cache architecture in a single-core (quad-core) system.