scispace - formally typeset
Search or ask a question

Showing papers on "Cache invalidation published in 2016"


Book ChapterDOI
07 Jul 2016
TL;DR: The Flush+Flush attack as mentioned in this paper uses the execution time of the flush instruction, which depends on whether data is cached or not, to reduce the number of cache misses.
Abstract: Research on cache attacks has shown that CPU caches leak significant information. Proposed detection mechanisms assume that all cache attacks cause more cache hits and cache misses than benign applications and use hardware performance counters for detection. In this article, we show that this assumption does not hold by developing a novel attack technique: the Flush+Flush attack. The Flush+Flush attack only relies on the execution time of the flush instruction, which depends on whether data is cached or not. Flush+Flush does not make any memory accesses, contrary to any other cache attack. Thus, it causes no cache misses at all and the number of cache hits is reduced to a minimum due to the constant cache flushes. Therefore, Flush+Flush attacks are stealthy, i.e., the spy process cannot be detected based on cache hits and misses, or state-of-the-art detection mechanisms. The Flush+Flush attack runs in a higher frequency and thus is faster than any existing cache attack. With 496i?źKB/s in a cross-core covert channel it is 6.7 times faster than any previously published cache covert channel.

416 citations


Proceedings ArticleDOI
TL;DR: In this article, the authors considered a cache-aided wireless network with a library of files and showed that the sum degrees-of-freedom (sum-DoF) of the network is within a factor of 2 of the optimum under one-shot linear schemes.
Abstract: We consider a system comprising a library of $N$ files (e.g., movies) and a wireless network with $K_T$ transmitters, each equipped with a local cache of size of $M_T$ files, and $K_R$ receivers, each equipped with a local cache of size of $M_R$ files. Each receiver will ask for one of the $N$ files in the library, which needs to be delivered. The objective is to design the cache placement (without prior knowledge of receivers' future requests) and the communication scheme to maximize the throughput of the delivery. In this setting, we show that the sum degrees-of-freedom (sum-DoF) of $\min\left\{\frac{K_T M_T+K_R M_R}{N},K_R\right\}$ is achievable, and this is within a factor of 2 of the optimum, under one-shot linear schemes. This result shows that (i) the one-shot sum-DoF scales linearly with the aggregate cache size in the network (i.e., the cumulative memory available at all nodes), (ii) the transmitters' and receivers' caches contribute equally in the one-shot sum-DoF, and (iii) caching can offer a throughput gain that scales linearly with the size of the network. To prove the result, we propose an achievable scheme that exploits the redundancy of the content at transmitters' caches to cooperatively zero-force some outgoing interference and availability of the unintended content at receivers' caches to cancel (subtract) some of the incoming interference. We develop a particular pattern for cache placement that maximizes the overall gains of cache-aided transmit and receive interference cancellations. For the converse, we present an integer optimization problem which minimizes the number of communication blocks needed to deliver any set of requested files to the receivers. We then provide a lower bound on the value of this optimization problem, hence leading to an upper bound on the linear one-shot sum-DoF of the network, which is within a factor of 2 of the achievable sum-DoF.

190 citations


Proceedings ArticleDOI
11 Sep 2016
TL;DR: In this article, when the cache contents and the user demands are fixed, the authors connect the caching problem to an index coding problem and show the optimality of the MAN scheme under the conditions that the cache placement phase is restricted to be uncoded (i.e., pieces of the files can only copied into the user's cache).
Abstract: Caching is an effective way to reduce peak-hour network traffic congestion by storing some contents at user's local cache. Maddah-Ali and Niesen (MAN) initiated a fundamental study of caching systems by proposing a scheme (with uncoded cache placement and linear network coding delivery) that is provably optimal to within a factor 4.7. In this paper, when the cache contents and the user demands are fixed, we connect the caching problem to an index coding problem and show the optimality of the MAN scheme under the conditions that (i) the cache placement phase is restricted to be uncoded (i.e, pieces of the files can only copied into the user's cache), and (ii) the number of users is no more than the number of files. As a consequence, further improvements to the MAN scheme are only possible through the use of coded cache placement.

188 citations


Proceedings ArticleDOI
10 Apr 2016
TL;DR: It is proved that the learning regret of PopCaching is sublinear in the number of content requests, and it is converges fast and asymptotically achieves the optimal cache hit rate.
Abstract: This paper presents a novel cache replacement method — Popularity-Driven Content Caching (PopCaching). PopCaching learns the popularity of content and uses it to determine which content it should store and which it should evict from the cache. Popularity is learned in an online fashion, requires no training phase and hence, it is more responsive to continuously changing trends of content popularity. We prove that the learning regret of PopCaching (i.e., the gap between the hit rate achieved by PopCaching and that by the optimal caching policy with hindsight) is sublinear in the number of content requests. Therefore, PopCaching converges fast and asymptotically achieves the optimal cache hit rate. We further demonstrate the effectiveness of PopCaching by applying it to a movie.douban.com dataset that contains over 38 million requests. Our results show significant cache hit rate lift compared to existing algorithms, and the improvements can exceed 40% when the cache capacity is limited. In addition, PopCaching has low complexity.

146 citations


Proceedings ArticleDOI
10 Apr 2016
TL;DR: This paper proposes utility-driven caching, where each content is associate with each content a utility, which is a function of the corresponding content hit probability, and develops online algorithms that can be used by service providers to implement various caching policies based on arbitrary utility functions.
Abstract: In any caching system, the admission and eviction policies determine which contents are added and removed from a cache when a miss occurs. Usually, these policies are devised so as to mitigate staleness and increase the hit probability. Nonetheless, the utility of having a high hit probability can vary across contents. This occurs, for instance, when service level agreements must be met, or if certain contents are more difficult to obtain than others. In this paper, we propose utility-driven caching, where we associate with each content a utility, which is a function of the corresponding content hit probability. We formulate optimization problems where the objectives are to maximize the sum of utilities over all contents. These problems differ according to the stringency of the cache capacity constraint. Our framework enables us to reverse engineer classical replacement policies such as LRU and FIFO, by computing the utility functions that they maximize. We also develop online algorithms that can be used by service providers to implement various caching policies based on arbitrary utility functions.

142 citations


Journal ArticleDOI
TL;DR: Both the fundamental performance limits of cache networks and the practical challenges that need to be overcome in real-life scenarios are discussed.
Abstract: Caching is an essential technique to improve throughput and latency in a vast variety of applications. The core idea is to duplicate content in memories distributed across the network, which can then be exploited to deliver requested content with less congestion and delay. The traditional role of cache memories is to deliver the maximal amount of requested content locally rather than from a remote server. While this approach is optimal for single-cache systems, it has recently been shown to be significantly suboptimal for systems with multiple caches (i.e., cache networks). Instead, cache memories should be used to enable a coded multicasting gain. In this article, we survey these recent developments. We discuss both the fundamental performance limits of cache networks and the practical challenges that need to be overcome in real-life scenarios.

107 citations


Proceedings Article
16 Mar 2016
TL;DR: Cliffhanger is developed, a lightweight iterative algorithm that runs on memory cache servers, which incrementally optimizes the resource allocations across and within applications based on dynamically changing workloads and designs a novel technique for dealing with performance cliffs incrementally and locally.
Abstract: Web-scale applications are heavily reliant on memory cache systems such as Memcached to improve throughput and reduce user latency. Small performance improvements in these systems can result in large end-to-end gains. For example, a marginal increase in hit rate of 1% can reduce the application layer latency by over 35%. However, existing web cache resource allocation policies are workload oblivious and first-come-first-serve. By analyzing measurements from a widely used caching service, Memcachier, we demonstrate that existing cache allocation techniques leave significant room for improvement. We develop Cliffhanger, a lightweight iterative algorithm that runs on memory cache servers, which incrementally optimizes the resource allocations across and within applications based on dynamically changing workloads. It has been shown that cache allocation algorithms underperform when there are performance cliffs, in which minor changes in cache allocation cause large changes in the hit rate. We design a novel technique for dealing with performance cliffs incrementally and locally. We demonstrate that for the Memcachier applications, on average, Cliffhanger increases the overall hit rate 1.2%, reduces the total number of cache misses by 36.7% and achieves the same hit rate with 45% less memory capacity.

99 citations


Journal ArticleDOI
TL;DR: It is proved that the gap between the hit rate achieved by Trend-Caching and that by the optimal caching policy with hindsight is sublinear in the number of video requests, thereby guaranteeing both fast convergence and asymptotically optimal cache hit rate.
Abstract: This paper presents Trend-Caching, a novel cache replacement method that optimizes cache performance according to the trends of video content. Trend-Caching explicitly learns the popularity trend of video content and uses it to determine which video it should store and which it should evict from the cache. Popularity is learned in an online fashion and requires no training phase, hence it is more responsive to continuously changing trends of videos. We prove that the learning regret of Trend-Caching (i.e., the gap between the hit rate achieved by Trend-Caching and that by the optimal caching policy with hindsight) is sublinear in the number of video requests, thereby guaranteeing both fast convergence and asymptotically optimal cache hit rate. We further validate the effectiveness of Trend-Caching by applying it to a movie.douban.com dataset that contains over 38 million requests. Our results show significant cache hit rate lift compared to existing algorithms, and the improvements can exceed 40% when the cache capacity is limited. Furthermore, Trend-Caching has low complexity.

97 citations


Journal ArticleDOI
TL;DR: An improved design of Newcache is presented, in terms of security, circuit design and simplicity, and it is shown that Newcache can be used as L1 data and instruction caches to improve security without impacting performance.
Abstract: Newcache is a secure cache that can thwart cache side-channel attacks to prevent the leakage of secret information All caches today are susceptible to cache side-channel attacks, despite software isolation of memory pages in virtual address spaces or virtual machines These cache attacks can leak secret encryption keys or private identity keys, nullifying any protection provided by strong cryptography Newcache uses a novel dynamic, randomized memory-to-cache mapping to thwart contention-based side-channel attacks, rather than the static mapping used by conventional set-associative caches In this article, the authors present an improved design of Newcache, in terms of security, circuit design and simplicity They show Newcache's security against a suite of cache side-channel attacks They evaluate Newcache's system performance for cloud computing, smartphone, and SPEC benchmarks and find that Newcache performs as well as conventional set-associative caches, and sometimes better They also designed a VLSI test chip with a 32-Kbyte Newcache and a 32-Kbyte, eight-way, set-associative cache and verified that the access latency, power, and area of the two caches are comparable These results show that Newcache can be used as L1 data and instruction caches to improve security without impacting performance

96 citations


Proceedings ArticleDOI
05 Jun 2016
TL;DR: The proposed SecDCP scheme changes the size of cache partitions at run time for better performance while preventing insecure information leakage between processes, and improves performance by up to 43% and by an average of 12.5% over static cache partitioning.
Abstract: In today's multicore processors, the last-level cache is often shared by multiple concurrently running processes to make efficient use of hardware resources. However, previous studies have shown that a shared cache is vulnerable to timing channel attacks that leak confidential information from one process to another. Static cache partitioning can eliminate the cache timing channels but incurs significant performance overhead. In this paper, we propose Secure Dynamic Cache Partitioning (SecDCP), a partitioning technique that defeats cache timing channel attacks. The SecDCP scheme changes the size of cache partitions at run time for better performance while preventing insecure information leakage between processes. For cache-sensitive multiprogram workloads, our experimental results show that SecDCP improves performance by up to 43% and by an average of 12.5% over static cache partitioning.

92 citations


Proceedings ArticleDOI
22 May 2016
TL;DR: CaSE utilizes TrustZone and Cache-as-RAM technique to create a cache-based isolated execution environment, which can protect both code and data of security-sensitive applications against the compromised OS and the cold boot attack.
Abstract: Recognizing the pressing demands to secure embedded applications, ARM TrustZone has been adopted in both academic research and commercial products to protect sensitive code and data in a privileged, isolated execution environment. However, the design of TrustZone cannot prevent physical memory disclosure attacks such as cold boot attack from gaining unrestricted read access to the sensitive contents in the dynamic random access memory (DRAM). A number of system-on-chip (SoC) bound execution solutions have been proposed to thaw the cold boot attack by storing sensitive data only in CPU registers, CPU cache or internal RAM. However, when the operating system, which is responsible for creating and maintaining the SoC-bound execution environment, is compromised, all the sensitive data is leaked. In this paper, we present the design and development of a cache-assisted secure execution framework, called CaSE, on ARM processors to defend against sophisticated attackers who can launch multi-vector attacks including software attacks and hardware memory disclosure attacks. CaSE utilizes TrustZone and Cache-as-RAM technique to create a cache-based isolated execution environment, which can protect both code and data of security-sensitive applications against the compromised OS and the cold boot attack. To protect the sensitive code and data against cold boot attack, applications are encrypted in memory and decrypted only within the processor for execution. The memory separation and the cache separation provided by TrustZone are used to protect the cached applications against compromised OS. We implement a prototype of CaSE on the i.MX53 running ARM Cortex-A8 processor. The experimental results show that CaSE incurs small impacts on system performance when executing cryptographic algorithms including AES, RSA, and SHA1.

Journal ArticleDOI
TL;DR: This work focuses on the cache allocation problem, namely, how to distribute the cache capacity across routers under a constrained total storage budget for the network and proposes a suboptimal heuristic method based on node centrality, which is more practical in dynamic networks with frequent content publishing.
Abstract: Content-centric networking (CCN) is a promising framework to rebuild the Internet's forwarding substrate around the concept of content. CCN advocates ubiquitous in-network caching to enhance content delivery, and thus each router has storage space to cache frequently requested content. In this work, we focus on the cache allocation problem, namely, how to distribute the cache capacity across routers under a constrained total storage budget for the network. We first formulate this problem as a content placement problem and obtain the optimal solution by a two-step method. We then propose a suboptimal heuristic method based on node centrality, which is more practical in dynamic networks with frequent content publishing. We investigate through simulations the factors that affect the optimal cache allocation, and perhaps more importantly we use a real-life Internet topology and video access logs from a large scale Internet video provider to evaluate the performance of various cache allocation methods. We observe that network topology and content popularity are two important factors that affect where exactly should cache capacity be placed. Further, the heuristic method comes with only a very limited performance penalty compared to the optimal allocation. Finally, using our findings, we provide recommendations for network operators on the best deployment of CCN caches capacity over routers.

Proceedings ArticleDOI
24 Oct 2016
TL;DR: A novel construction of flush-reload side channels on last-level caches of ARM processors, which, particularly, exploits return-oriented programming techniques to reload instructions is demonstrated.
Abstract: Cache side-channel attacks have been extensively studied on x86 architectures, but much less so on ARM processors. The technical challenges to conduct side-channel attacks on ARM, presumably, stem from the poorly documented ARM cache implementations, such as cache coherence protocols and cache flush operations, and also the lack of understanding of how different cache implementations will affect side-channel attacks. This paper presents a systematic exploration of vectors for flush-reload attacks on ARM processors. flush-reload attacks are among the most well-known cache side-channel attacks on x86. It has been shown in previous work that they are capable of exfiltrating sensitive information with high fidelity. We demonstrate in this work a novel construction of flush-reload side channels on last-level caches of ARM processors, which, particularly, exploits return-oriented programming techniques to reload instructions. We also demonstrate several attacks on Android OS (e.g., detecting hardware events and tracing software execution paths) to highlight the implications of such attacks for Android devices.

Journal ArticleDOI
TL;DR: This article provides a survey on static cache analysis for real-time systems, presenting the challenges and static analysis techniques for independent programs with respect to different cache features, followed by a survey of existing tools based on static techniques for cache analysis.
Abstract: Real-time systems are reactive computer systems that must produce their reaction to a stimulus within given time bounds. A vital verification requirement is to estimate the Worst-Case Execution Time (WCET) of programs. These estimates are then used to predict the timing behavior of the overall system. The execution time of a program heavily depends on the underlying hardware, among which cache has the biggest influence. Analyzing cache behavior is very challenging due to the versatile cache features and complex execution environment. This article provides a survey on static cache analysis for real-time systems. We first present the challenges and static analysis techniques for independent programs with respect to different cache features. Then, the discussion is extended to cache analysis in complex execution environment, followed by a survey of existing tools based on static techniques for cache analysis. An outlook for future research is provided at last.

Proceedings Article
22 Feb 2016
TL;DR: The paper proposes a new cache demand model, Reuse Working Set (RWS), to capture only the data with good temporal locality, and uses the RWS size (RWSS) to model a workload's cache demand.
Abstract: Host-side flash caching has emerged as a promising solution to the scalability problem of virtual machine (VM) storage in cloud computing systems, but it still faces serious limitations in capacity and endurance. This paper presents CloudCache, an on-demand cache management solution to meet VM cache demands and minimize cache wear-out. First, to support on-demand cache allocation, the paper proposes a new cache demand model, Reuse Working Set (RWS), to capture only the data with good temporal locality, and uses the RWS size (RWSS) to model a workload's cache demand. By predicting the RWSS online and admitting only RWS into the cache, CloudCache satisfies the workload's actual cache demand and minimizes the induced wear-out. Second, to handle situations where a cache is insufficient for the VMs' demands, the paper proposes a dynamic cache migration approach to balance cache load across hosts by live migrating cached data along with the VMs. It includes both on-demand migration of dirty data and background migration of RWS to optimize the performance of the migrating VM. It also supports rate limiting on the cache data transfer to limit the impact to the co-hosted VMs. Finally, the paper presents comprehensive experimental evaluations using real-world traces to demonstrate the effectiveness of CloudCache.

Patent
14 Nov 2016
TL;DR: In this paper, the cache servers are selected for providing the requested content item to the first client, and instructions are transmitted to the client to determine whether the selected one or more cache servers have a local copy of the desired content item.
Abstract: Systems, methods, and computer-readable media provide content items to clients. In one implementation, a system stores data identifying a plurality of cache servers, the cache servers storing the content items for download by a plurality of clients. The system receives a request from a first one of the clients to download one of the content items. The system selects one or more of the cache servers for providing the requested content item to the first client. The system transmits identifiers of the selected one or more cache servers to the first client, and transmits instructions to the first client. The instructions are operable, when executed by the first client, to determine whether the selected one or more cache servers have a local copy of the requested content item. When the first client determines that a first one of the selected one or more cache servers has a local copy of the requested content item, the first client downloads the requested content item from the first selected cache server.

Journal ArticleDOI
TL;DR: A novel cache management algorithm for flash-based disk cache, named Lazy Adaptive Replacement Cache (LARC), which can filter out seldom accessed blocks and prevent them from entering cache and improves performance and extends SSD lifetime at the same time.
Abstract: For years, the increasing popularity of flash memory has been changing storage systems. Flash-based solid-state drives (SSDs) are widely used as a new cache tier on top of hard disk drives (HDDs) to speed up data-intensive applications. However, the endurance problem of flash memory remains a concern and is getting worse with the adoption of MLC and TLC flash. In this article, we propose a novel cache management algorithm for flash-based disk cache named Lazy Adaptive Replacement Cache (LARC). LARC adopts the idea of selective caching to filter out seldom accessed blocks and prevent them from entering cache. This avoids cache pollution and preserves popular blocks in cache for a longer period of time, leading to a higher hit rate. Meanwhile, by avoiding unnecessary cache replacements, LARC reduces the volume of data written to the SSD and yields an SSD-friendly access pattern. In this way, LARC improves the performance and endurance of the SSD at the same time. LARC is self-tuning and incurs little overhead. It has been extensively evaluated by both trace-driven simulations and synthetic benchmarks on a prototype implementation. Our experiments show that LARC outperforms state-of-art algorithms for different kinds of workloads and extends SSD lifetime by up to 15.7 times.

Proceedings ArticleDOI
12 Mar 2016
TL;DR: A new probabilistic cache model designed for high-performance replacement policies is presented that uses absolute reuse distances instead of stack distances, and models replacement policies as abstract ranking functions, which let us model arbitrary age-based replacement policies.
Abstract: Modern processors use high-performance cache replacement policies that outperform traditional alternatives like least-recently used (LRU). Unfortunately, current cache models do not capture these high-performance policies as most use stack distances, which are inherently tied to LRU or its variants. Accurate predictions of cache performance enable many optimizations in multicore systems. For example, cache partitioning uses these predictions to divide capacity among applications in order to maximize performance, guarantee quality of service, or achieve other system objectives. Without an accurate model for high-performance replacement policies, these optimizations are unavailable to modern processors. We present a new probabilistic cache model designed for high-performance replacement policies. It uses absolute reuse distances instead of stack distances, and models replacement policies as abstract ranking functions. These innovations let us model arbitrary age-based replacement policies. Our model achieves median error of less than 1% across several high-performance policies on both synthetic and SPEC CPU2006 benchmarks. Finally, we present a case study showing how to use the model to improve shared cache performance.

Proceedings Article
22 Feb 2016
TL;DR: The paper presents a rigorous analysis of the algorithms to prove that they do not waste valuable cache space and that they are efficient in time and space usage, and shows that they can be extended to support the combination of compression and deduplication for flash caching and improve its performance and endurance.
Abstract: Flash caching has emerged as a promising solution to the scalability problems of storage systems by using fast flash memory devices as the cache for slower primary storage. But its adoption faces serious obstacles due to the limited capacity and endurance of flash devices. This paper presents CacheDedup, a solution that addresses these limitations using in-line deduplication. First, it proposes a novel architecture that integrates the caching of data and deduplication metadata (source addresses and fingerprints of the data) and efficiently manages these two components. Second, it proposes duplication-aware cache replacement algorithms (D-LRU, DARC) to optimize both cache performance and endurance. The paper presents a rigorous analysis of the algorithms to prove that they do not waste valuable cache space and that they are efficient in time and space usage. The paper also includes an experimental evaluation using real-world traces, which confirms that CacheDedup substantially improves I/O performance (up to 20% reduction in miss ratio and 51% in latency) and flash endurance (up to 89% reduction in writes sent to the cache device) compared to traditional cache management. It also shows that the proposed architecture and algorithms can be extended to support the combination of compression and deduplication for flash caching and improve its performance and endurance.

Proceedings ArticleDOI
01 Dec 2016
TL;DR: A novel decentralized asynchronous SGD algorithm called HogWild++ is proposed that outperforms state-of-the-art parallel SGD implementations in terms of efficiency and scalability and shows almost linear speedup on multi-socket NUMA systems.
Abstract: Stochastic Gradient Descent (SGD) is a popular technique for solving large-scale machine learning problems. In order to parallelize SGD on multi-core machines, asynchronous SGD (Hogwild!) has been proposed, where each core updates a global model vector stored in a shared memory simultaneously, without using explicit locks. We show that the scalability of Hogwild! on modern multi-socket CPUs is severely limited, especially on NUMA (Non-Uniform Memory Access) system, due to the excessive cache invalidation requests and false sharing. In this paper we propose a novel decentralized asynchronous SGD algorithm called HogWild++ that overcomes these drawbacks and shows almost linear speedup on multi-socket NUMA systems. The main idea in HogWild++ is to replace the global model vector with a set of local model vectors that are shared by a cluster (a set of cores), keep them synchronized through a decentralized token-based protocol that minimizes remote memory access conflicts and ensures convergence. We present the design and experimental evaluation of HogWild++ on a variety of datasets and show that it outperforms state-of-the-art parallel SGD implementations in terms of efficiency and scalability.

Journal ArticleDOI
TL;DR: A write intensity predictor that realizes the idea by exploiting a correlation between write intensity of blocks and memory access instructions that incur cache misses of those blocks is designed and a hybrid cache architecture in which write-intensive blocks identified by the predictor are placed into the SRAM region is designed.
Abstract: Spin-transfer torque RAM (STT-RAM) has emerged as an energy-efficient and high-density alternative to SRAM for large on-chip caches. However, its high write energy has been considered as a serious drawback. Hybrid caches mitigate this problem by incorporating a small SRAM cache for write-intensive data along with an STT-RAM cache. In such architectures, choosing cache blocks to be placed into the SRAM cache is the key to their energy efficiency. This paper proposes a new hybrid cache architecture called prediction hybrid cache. The key idea is to predict write intensity of cache blocks at the time of cache misses and determine block placement based on the prediction. We design a write intensity predictor that realize the idea by exploiting a correlation between write intensity of blocks and memory access instructions that incur cache misses of those blocks. It includes a mechanism to dynamically adapt the predictor to application characteristics. We also design a hybrid cache architecture in which write-intensive blocks identified by the predictor are placed into the SRAM region. Evaluations show that our scheme reduces energy consumption of hybrid caches by 28 percent (31 percent) on average compared to the existing hybrid cache architecture in a single-core (quad-core) system.

Proceedings ArticleDOI
26 Sep 2016
TL;DR: This paper proposes a utility-driven cache partitioning approach to cache resource allocation among multiple content providers, where each provider is associated with each content provider a utility that is a function of the hit rate to its content.
Abstract: In-network cache deployment is recognized as an effective technique for reducing content access delay. Caches serve content from multiple content providers, and wish to provide them differentiated services due to monetary incentives and legal obligations. Partitioning is a common approach in providing differentiated storage services. In this paper, we propose a utility-driven cache partitioning approach to cache resource allocation among multiple content providers, where we associate with each content provider a utility that is a function of the hit rate to its content. A cache is partitioned into slices with each partition being dedicated to a particular content provider. We formulate an optimization problem where the objective is to maximize the sum of weighted utilities over all content providers through proper cache partitioning, and mathematically show its convexity. We also give a formal proof that partitioning the cache yields better performance compared to sharing it. We validate the effectiveness of cache partitioning through numerical evaluations, and investigate the impact of various factors (e.g., content popularity, request rate) on the hit rates observed by contending content providers.

Proceedings ArticleDOI
15 Oct 2016
TL;DR: The Bunker Cache is proposed, a design that maps similar data to the same cache storage location based solely on their memory address, sacrificing some application quality loss for greater efficiency.
Abstract: The cost of moving and storing data is still a fundamental concern for computer architects. Inefficient handling of data can be attributed to conventional architectures being oblivious to the nature of the values that these data bits carry. We observe the phenomenon of spatio-value similarity, where data elements that are approximately similar in value exhibit spatial regularity in memory. This is inherent to 1) the data values of real-world applications, and 2) the way we store data structures in memory. We propose the Bunker Cache, a design that maps similar data to the same cache storage location based solely on their memory address, sacrificing some application quality loss for greater efficiency. The Bunker Cache enables performance gains (ranging from 1.08x to 1.19x) via reduced cache misses and energy savings (ranging from 1.18x to 1.39x) via reduced off-chip memory accesses and lower cache storage requirements. The Bunker Cache requires only modest changes to cache indexing hardware, integrating easily into commodity systems.

Journal ArticleDOI
TL;DR: This paper presents a new buffer cache architecture that adopts a small amount of non-volatile memory to maintain modified data and improves the buffer cache performance through space-efficient management techniques, such as delta-write and fragment-grouping.
Abstract: File I/O buffer caching plays an important role to narrow the wide speed gap between main memory and secondary storage. However, data loss or inconsistency may occur if the system crashes before updated data in the buffer cache is flushed to storage. Thus, most operating systems adopt a daemon that periodically flushes dirty data to secondary storage. This periodic flush degrades the caching efficiency seriously because most write requests lead to direct storage accesses. We show that periodic flush accounts for 30-70 percent of the total write traffic to storage. To remove this inefficiency, this paper presents a new buffer cache architecture that adopts a small amount of non-volatile memory to maintain modified data. This novel buffer cache architecture removes almost all storage accesses due to periodic flush operations without any loss of reliability. It also improves the buffer cache performance through space-efficient management techniques, such as delta-write and fragment-grouping. Our experimental results show that the proposed buffer cache reduces the storage write traffic by 44.3 percent and also improves the throughput by 23.6 percent on average.

Journal ArticleDOI
TL;DR: This paper presents a survey of cache bypassing techniques for CPUs, GPUs and CPU-GPU heterogeneous systems, and for caches designed with SRAM, non-volatile memory (NVM) and die-stacked DRAM, and underscores their differences and similarities.
Abstract: With increasing core-count, the cache demand of modern processors has also increased. However, due to strict area/power budgets and presence of poor data-locality workloads, blindly scaling cache capacity is both infeasible and ineffective. Cache bypassing is a promising technique to increase effective cache capacity without incurring power/area costs of a larger sized cache. However, injudicious use of cache bypassing can lead to bandwidth congestion and increased miss-rate and hence, intelligent techniques are required to harness its full potential. This paper presents a survey of cache bypassing techniques for CPUs, GPUs and CPU-GPU heterogeneous systems, and for caches designed with SRAM, non-volatile memory (NVM) and die-stacked DRAM. By classifying the techniques based on key parameters, it underscores their differences and similarities. We hope that this paper will provide insights into cache bypassing techniques and associated tradeoffs and will be useful for computer architects, system designers and other researchers.

Posted Content
TL;DR: It is shown that the joint cache-channel coding offers new global caching gains that scale with the number of strong receivers in the network and significantly improves over traditional schemes based on the separate cache- channel coding.
Abstract: We study noisy broadcast networks with local cache memories at the receivers, where the transmitter can pre-store information even before learning the receivers' requests We mostly focus on packet-erasure broadcast networks with two disjoint sets of receivers: a set of weak receivers with all-equal erasure probabilities and equal cache sizes and a set of strong receivers with all-equal erasure probabilities and no cache memories We present lower and upper bounds on the capacity-memory tradeoff of this network The lower bound is achieved by a new joint cache-channel coding idea and significantly improves on schemes that are based on separate cache-channel coding We discuss how this coding idea could be extended to more general discrete memoryless broadcast channels and to unequal cache sizes Our upper bound holds for all stochastically degraded broadcast channels For the described packet-erasure broadcast network, our lower and upper bounds are tight when there is a single weak receiver (and any number of strong receivers) and the cache memory size does not exceed a given threshold When there are a single weak receiver, a single strong receiver, and two files, then we can strengthen our upper and lower bounds so as they coincide over a wide regime of cache sizes Finally, we completely characterise the rate-memory tradeoff for general discrete-memoryless broadcast channels with arbitrary cache memory sizes and arbitrary (asymmetric) rates when all receivers always demand exactly the same file

Proceedings ArticleDOI
01 Oct 2016
TL;DR: In this paper, the authors proposed replacement and checkpoint policies for SRAM and NVM based hybrid cache in NVPs whose execution is interrupted frequently, and the experimental results show that the proposed architectures and polices outperform existing cache architectures for NVP.
Abstract: Energy harvesting is one of the most promising battery alternatives to power future generation embedded systems in Internet of Things (IoT). However, energy harvesting powered embedded systems suffer from frequent execution interruption due to unstable energy supply. To bridge intermittent program execution across different power cycles, non-volatile processor (NVP) was proposed to checkpoint register contents during power failure. Together with register contents, the cache contents also need to be preserved during power failure. While pure non-volatile memory (NVM) based cache is an intuitive option, it suffers from inferior performance due to high write latency and energy overhead. In this paper, we will propose replacement and checkpoint policies for SRAM and NVM based hybrid cache in NVPs whose execution is interrupted frequently. Checkpointing aware cache replacement polices and smart checkpointing polices are proposed to achieve satisfactory performance and efficient checkpointing upon a power failure and fast resumption when power returns. The experimental results show that the proposed architectures and polices outperform existing cache architectures for NVPs.

Proceedings ArticleDOI
Xingbo Wu1, Li Zhang2, Yandong Wang2, Yufei Ren2, Michel H. T. Hack2, Song Jiang1 
18 Apr 2016
TL;DR: By leveraging highly skewed data access pattern common in real-world KV cache workloads, this paper shows that zExpander can both reduce miss ratio through improved memory efficiency and maintain high performance for a KV Cache.
Abstract: While key-value (KV) cache, such as memcached, dedicates a large volume of expensive memory to holding performance-critical data, it is important to improve memory efficiency, or to reduce cache miss ratio without adding more memory. As we find that optimizing replacement algorithms is of limited effect for this purpose, a promising approach is to use a compact data organization and data compression to increase effective cache size. However, this approach has the risk of degrading the cache's performance due to additional computation cost. A common perception is that a high-performance KV cache is not compatible with use of data compacting techniques. In this paper, we show that, by leveraging highly skewed data access pattern common in real-world KV cache workloads, we can both reduce miss ratio through improved memory efficiency and maintain high performance for a KV cache. Specifically, we design and implement a KV cache system, named zExpander, which dynamically partitions the cache into two sub-caches. One serves frequently accessed data for high performance, and the other compacts data and metadata for high memory efficiency to reduce misses. Experiments show that zExpander can increase memcached's effective cache size by up to 2x and reduce miss ratio by up to 46%. When integrated with a cache of a higher performance, its advantages remain. For example, with 24 threads on a YCSB workload zExpander can achieve throughput of 32 million RPS with 36% of its cache misses removed.

Journal ArticleDOI
TL;DR: A proactive approach to making a comprehensive measurement study on client-side cache performance of mobile web browsing by proactively crawling resources from hundreds of websites periodically with a fine-grained time interval, which identifies two major problems - Redundant Transfer and Miscached Resource, which lead to unsatisfactory cache performance.
Abstract: The web browser is one of the most significant applications on mobile devices such as smartphones. However, the user experience of mobile web browsing is undesirable because of the slow resource loading. To improve the performance of web resource loading, client-side cache has been adopted as a key mechanism. However, the existing passive measurement studies cannot comprehensively characterize the “ client-side ” cache performance of mobile web browsing. For example, most of these studies mainly focus on client-side implementations but not server-side configurations, suffer from biased user behaviors, and fail to study “ miscached ” resources. To address these issues, in this article, we present a proactive approach to making a comprehensive measurement study on client-side cache performance. The key idea of our approach is to proactively crawl resources from hundreds of websites periodically with a fine-grained time interval. Thus, we are able to uncover the resource update history and cache configurations at the server side, and analyze the cache performance in various time granularities. Based on our collected data, we build a new cache analysis model and study the upper bound of how high percentage of resources could potentially be cached and how effectively the caching works in practice. We report detailed analysis results of different websites and various types of web resources, and identify the problems caused by unsatisfactory cache performance. In particular, we identify two major problems— Redundant Transfer and Miscached Resource , which lead to unsatisfactory cache performance. We investigate three main root causes: Same Content , Heuristic Expiration , and Conservative Expiration Time , and discuss what mobile web developers can do to mitigate those problems.

Patent
03 Mar 2016
TL;DR: In this article, the authors present a system where a cache has one or more memories and a metadata service, which is able to receive requests for data stored in the cache from a first client and from a second client.
Abstract: In one embodiment, a computer system includes a cache having one or more memories and a metadata service. The metadata service is able to receive requests for data stored in the cache from a first client and from a second client. The metadata service is further able to determine whether the performance of the cache would be improved by relocating the data stored in the cache. The metadata service is further operable to relocate the data stored in the cache when such relocation would improve the performance of the cache.