scispace - formally typeset
Search or ask a question

Showing papers on "Cache coloring published in 2016"


Proceedings ArticleDOI
TL;DR: In this article, the authors considered a cache-aided wireless network with a library of files and showed that the sum degrees-of-freedom (sum-DoF) of the network is within a factor of 2 of the optimum under one-shot linear schemes.
Abstract: We consider a system comprising a library of $N$ files (e.g., movies) and a wireless network with $K_T$ transmitters, each equipped with a local cache of size of $M_T$ files, and $K_R$ receivers, each equipped with a local cache of size of $M_R$ files. Each receiver will ask for one of the $N$ files in the library, which needs to be delivered. The objective is to design the cache placement (without prior knowledge of receivers' future requests) and the communication scheme to maximize the throughput of the delivery. In this setting, we show that the sum degrees-of-freedom (sum-DoF) of $\min\left\{\frac{K_T M_T+K_R M_R}{N},K_R\right\}$ is achievable, and this is within a factor of 2 of the optimum, under one-shot linear schemes. This result shows that (i) the one-shot sum-DoF scales linearly with the aggregate cache size in the network (i.e., the cumulative memory available at all nodes), (ii) the transmitters' and receivers' caches contribute equally in the one-shot sum-DoF, and (iii) caching can offer a throughput gain that scales linearly with the size of the network. To prove the result, we propose an achievable scheme that exploits the redundancy of the content at transmitters' caches to cooperatively zero-force some outgoing interference and availability of the unintended content at receivers' caches to cancel (subtract) some of the incoming interference. We develop a particular pattern for cache placement that maximizes the overall gains of cache-aided transmit and receive interference cancellations. For the converse, we present an integer optimization problem which minimizes the number of communication blocks needed to deliver any set of requested files to the receivers. We then provide a lower bound on the value of this optimization problem, hence leading to an upper bound on the linear one-shot sum-DoF of the network, which is within a factor of 2 of the achievable sum-DoF.

190 citations


Proceedings ArticleDOI
11 Sep 2016
TL;DR: In this article, when the cache contents and the user demands are fixed, the authors connect the caching problem to an index coding problem and show the optimality of the MAN scheme under the conditions that the cache placement phase is restricted to be uncoded (i.e., pieces of the files can only copied into the user's cache).
Abstract: Caching is an effective way to reduce peak-hour network traffic congestion by storing some contents at user's local cache. Maddah-Ali and Niesen (MAN) initiated a fundamental study of caching systems by proposing a scheme (with uncoded cache placement and linear network coding delivery) that is provably optimal to within a factor 4.7. In this paper, when the cache contents and the user demands are fixed, we connect the caching problem to an index coding problem and show the optimality of the MAN scheme under the conditions that (i) the cache placement phase is restricted to be uncoded (i.e, pieces of the files can only copied into the user's cache), and (ii) the number of users is no more than the number of files. As a consequence, further improvements to the MAN scheme are only possible through the use of coded cache placement.

188 citations


Proceedings ArticleDOI
10 Apr 2016
TL;DR: It is proved that the learning regret of PopCaching is sublinear in the number of content requests, and it is converges fast and asymptotically achieves the optimal cache hit rate.
Abstract: This paper presents a novel cache replacement method — Popularity-Driven Content Caching (PopCaching). PopCaching learns the popularity of content and uses it to determine which content it should store and which it should evict from the cache. Popularity is learned in an online fashion, requires no training phase and hence, it is more responsive to continuously changing trends of content popularity. We prove that the learning regret of PopCaching (i.e., the gap between the hit rate achieved by PopCaching and that by the optimal caching policy with hindsight) is sublinear in the number of content requests. Therefore, PopCaching converges fast and asymptotically achieves the optimal cache hit rate. We further demonstrate the effectiveness of PopCaching by applying it to a movie.douban.com dataset that contains over 38 million requests. Our results show significant cache hit rate lift compared to existing algorithms, and the improvements can exceed 40% when the cache capacity is limited. In addition, PopCaching has low complexity.

146 citations


Proceedings ArticleDOI
12 Mar 2016
TL;DR: This paper will present the first set of shared cache QoS techniques designed and implemented in state-of-the-art commercial servers (the Intel® Xeon® processor E5-2600 v3 product family) and describe two key technologies: Cache Monitoring Technology (CMT) to enable monitoring of shared caches usage by different applications and Cache Allocation Technology(CAT) which enables redistribution of shared Cache space between applications to address contention.
Abstract: Over the last decade, addressing quality of service (QoS) in multi-core server platforms has been growing research topic. QoS techniques have been proposed to address the shared resource contention between co-running applications or virtual machines in servers and thereby provide better isolation, performance determinism and potentially improve overall throughput. One of the most important shared resources is cache space. Most proposals for addressing shared cache contention are based on simulations and analysis and no commercial platforms were available that integrated such techniques and provided a practical solution. In this paper, we will present the first set of shared cache QoS techniques designed and implemented in state-of-the-art commercial servers (the Intel® Xeon® processor E5-2600 v3 product family). We will describe two key technologies: (i) Cache Monitoring Technology (CMT) to enable monitoring of shared cache usage by different applications and (ii) Cache Allocation Technology (CAT) which enables redistribution of shared cache space between applications to address contention. This is the first paper to describing these techniques as they moved from concept to reality, starting from early research to product implementation. We will also present case studies highlighting the value of these techniques using example scenarios of multi-programmed workloads, virtualized platforms in datacenters and communications platforms. Finally, we will describe initial software infrastructure and enabling for industry practitioners and researchers to take advantage of these technologies for their QoS needs.

136 citations


Journal ArticleDOI
TL;DR: Both the fundamental performance limits of cache networks and the practical challenges that need to be overcome in real-life scenarios are discussed.
Abstract: Caching is an essential technique to improve throughput and latency in a vast variety of applications. The core idea is to duplicate content in memories distributed across the network, which can then be exploited to deliver requested content with less congestion and delay. The traditional role of cache memories is to deliver the maximal amount of requested content locally rather than from a remote server. While this approach is optimal for single-cache systems, it has recently been shown to be significantly suboptimal for systems with multiple caches (i.e., cache networks). Instead, cache memories should be used to enable a coded multicasting gain. In this article, we survey these recent developments. We discuss both the fundamental performance limits of cache networks and the practical challenges that need to be overcome in real-life scenarios.

107 citations


Proceedings Article
16 Mar 2016
TL;DR: Cliffhanger is developed, a lightweight iterative algorithm that runs on memory cache servers, which incrementally optimizes the resource allocations across and within applications based on dynamically changing workloads and designs a novel technique for dealing with performance cliffs incrementally and locally.
Abstract: Web-scale applications are heavily reliant on memory cache systems such as Memcached to improve throughput and reduce user latency. Small performance improvements in these systems can result in large end-to-end gains. For example, a marginal increase in hit rate of 1% can reduce the application layer latency by over 35%. However, existing web cache resource allocation policies are workload oblivious and first-come-first-serve. By analyzing measurements from a widely used caching service, Memcachier, we demonstrate that existing cache allocation techniques leave significant room for improvement. We develop Cliffhanger, a lightweight iterative algorithm that runs on memory cache servers, which incrementally optimizes the resource allocations across and within applications based on dynamically changing workloads. It has been shown that cache allocation algorithms underperform when there are performance cliffs, in which minor changes in cache allocation cause large changes in the hit rate. We design a novel technique for dealing with performance cliffs incrementally and locally. We demonstrate that for the Memcachier applications, on average, Cliffhanger increases the overall hit rate 1.2%, reduces the total number of cache misses by 36.7% and achieves the same hit rate with 45% less memory capacity.

99 citations


Journal ArticleDOI
TL;DR: It is proved that the gap between the hit rate achieved by Trend-Caching and that by the optimal caching policy with hindsight is sublinear in the number of video requests, thereby guaranteeing both fast convergence and asymptotically optimal cache hit rate.
Abstract: This paper presents Trend-Caching, a novel cache replacement method that optimizes cache performance according to the trends of video content. Trend-Caching explicitly learns the popularity trend of video content and uses it to determine which video it should store and which it should evict from the cache. Popularity is learned in an online fashion and requires no training phase, hence it is more responsive to continuously changing trends of videos. We prove that the learning regret of Trend-Caching (i.e., the gap between the hit rate achieved by Trend-Caching and that by the optimal caching policy with hindsight) is sublinear in the number of video requests, thereby guaranteeing both fast convergence and asymptotically optimal cache hit rate. We further validate the effectiveness of Trend-Caching by applying it to a movie.douban.com dataset that contains over 38 million requests. Our results show significant cache hit rate lift compared to existing algorithms, and the improvements can exceed 40% when the cache capacity is limited. Furthermore, Trend-Caching has low complexity.

97 citations


Journal ArticleDOI
TL;DR: An improved design of Newcache is presented, in terms of security, circuit design and simplicity, and it is shown that Newcache can be used as L1 data and instruction caches to improve security without impacting performance.
Abstract: Newcache is a secure cache that can thwart cache side-channel attacks to prevent the leakage of secret information All caches today are susceptible to cache side-channel attacks, despite software isolation of memory pages in virtual address spaces or virtual machines These cache attacks can leak secret encryption keys or private identity keys, nullifying any protection provided by strong cryptography Newcache uses a novel dynamic, randomized memory-to-cache mapping to thwart contention-based side-channel attacks, rather than the static mapping used by conventional set-associative caches In this article, the authors present an improved design of Newcache, in terms of security, circuit design and simplicity They show Newcache's security against a suite of cache side-channel attacks They evaluate Newcache's system performance for cloud computing, smartphone, and SPEC benchmarks and find that Newcache performs as well as conventional set-associative caches, and sometimes better They also designed a VLSI test chip with a 32-Kbyte Newcache and a 32-Kbyte, eight-way, set-associative cache and verified that the access latency, power, and area of the two caches are comparable These results show that Newcache can be used as L1 data and instruction caches to improve security without impacting performance

96 citations


Journal ArticleDOI
18 Jun 2016
TL;DR: This work proposes adding just enough functionality to dynamically identify instructions at the core and migrate them to the memory controller for execution as soon as source data arrives from DRAM, allowing memory requests issued by the new Enhanced Memory Controller to experience a 20% lower latency than ifissued by the core.
Abstract: On-chip contention increases memory access latency for multicore processors. We identify that this additional latency has a substantial efect on performance for an important class of latency-critical memory operations: those that result in a cache miss and are dependent on data from a prior cache miss. We observe that the number of instructions between the frst cache miss and its dependent cache miss is usually small. To minimize dependent cache miss latency, we propose adding just enough functionality to dynamically identify these instructions at the core and migrate them to the memory controller for execution as soon as source data arrives from DRAM. This migration allows memory requests issued by our new Enhanced Memory Controller (EMC) to experience a 20% lower latency than if issued by the core. On a set of memory intensive quad-core workloads, the EMC results in a 13% improvement in system performance and a 5% reduction in energy consumption over a system with a Global History Bufer prefetcher, the highest performing prefetcher in our evaluation.

94 citations


Proceedings ArticleDOI
05 Jun 2016
TL;DR: The proposed SecDCP scheme changes the size of cache partitions at run time for better performance while preventing insecure information leakage between processes, and improves performance by up to 43% and by an average of 12.5% over static cache partitioning.
Abstract: In today's multicore processors, the last-level cache is often shared by multiple concurrently running processes to make efficient use of hardware resources. However, previous studies have shown that a shared cache is vulnerable to timing channel attacks that leak confidential information from one process to another. Static cache partitioning can eliminate the cache timing channels but incurs significant performance overhead. In this paper, we propose Secure Dynamic Cache Partitioning (SecDCP), a partitioning technique that defeats cache timing channel attacks. The SecDCP scheme changes the size of cache partitions at run time for better performance while preventing insecure information leakage between processes. For cache-sensitive multiprogram workloads, our experimental results show that SecDCP improves performance by up to 43% and by an average of 12.5% over static cache partitioning.

92 citations


Proceedings ArticleDOI
22 May 2016
TL;DR: CaSE utilizes TrustZone and Cache-as-RAM technique to create a cache-based isolated execution environment, which can protect both code and data of security-sensitive applications against the compromised OS and the cold boot attack.
Abstract: Recognizing the pressing demands to secure embedded applications, ARM TrustZone has been adopted in both academic research and commercial products to protect sensitive code and data in a privileged, isolated execution environment. However, the design of TrustZone cannot prevent physical memory disclosure attacks such as cold boot attack from gaining unrestricted read access to the sensitive contents in the dynamic random access memory (DRAM). A number of system-on-chip (SoC) bound execution solutions have been proposed to thaw the cold boot attack by storing sensitive data only in CPU registers, CPU cache or internal RAM. However, when the operating system, which is responsible for creating and maintaining the SoC-bound execution environment, is compromised, all the sensitive data is leaked. In this paper, we present the design and development of a cache-assisted secure execution framework, called CaSE, on ARM processors to defend against sophisticated attackers who can launch multi-vector attacks including software attacks and hardware memory disclosure attacks. CaSE utilizes TrustZone and Cache-as-RAM technique to create a cache-based isolated execution environment, which can protect both code and data of security-sensitive applications against the compromised OS and the cold boot attack. To protect the sensitive code and data against cold boot attack, applications are encrypted in memory and decrypted only within the processor for execution. The memory separation and the cache separation provided by TrustZone are used to protect the cached applications against compromised OS. We implement a prototype of CaSE on the i.MX53 running ARM Cortex-A8 processor. The experimental results show that CaSE incurs small impacts on system performance when executing cryptographic algorithms including AES, RSA, and SHA1.

Journal ArticleDOI
TL;DR: This work focuses on the cache allocation problem, namely, how to distribute the cache capacity across routers under a constrained total storage budget for the network and proposes a suboptimal heuristic method based on node centrality, which is more practical in dynamic networks with frequent content publishing.
Abstract: Content-centric networking (CCN) is a promising framework to rebuild the Internet's forwarding substrate around the concept of content. CCN advocates ubiquitous in-network caching to enhance content delivery, and thus each router has storage space to cache frequently requested content. In this work, we focus on the cache allocation problem, namely, how to distribute the cache capacity across routers under a constrained total storage budget for the network. We first formulate this problem as a content placement problem and obtain the optimal solution by a two-step method. We then propose a suboptimal heuristic method based on node centrality, which is more practical in dynamic networks with frequent content publishing. We investigate through simulations the factors that affect the optimal cache allocation, and perhaps more importantly we use a real-life Internet topology and video access logs from a large scale Internet video provider to evaluate the performance of various cache allocation methods. We observe that network topology and content popularity are two important factors that affect where exactly should cache capacity be placed. Further, the heuristic method comes with only a very limited performance penalty compared to the optimal allocation. Finally, using our findings, we provide recommendations for network operators on the best deployment of CCN caches capacity over routers.

Proceedings ArticleDOI
Luna Xu1, Min Li2, Li Zhang2, Ali R. Butt1, Yandong Wang2, Zane Zhenhua Hu3 
23 May 2016
TL;DR: MEMTUNE dynamically tunes computation/caching memory partitions at runtime based on workload memory demand and in-memory data cache needs, and if needed, the scheduling information from the analytic framework is leveraged to evict data that will be needed in the near future.
Abstract: Memory is a crucial resource for big data processing frameworks such as Spark and M3R, where the memory is used both for computation and for caching intermediate storage data. Consequently, optimizing memory is the key to extracting high performance. The extant approach is to statically split thememory for computation and caching based on workload profiling. This approach is unable to capture the varying workload characteristics and dynamic memory demands. Another factor that affects caching efficiency is the choice of data placement and eviction policy. The extant LRU policy is oblivious of task scheduling information from the analytic frameworks, and thus can lead to lost optimization opportunities. In this paper, we address the above issues by designing MEMTUNE, a dynamic memory manager for in-memory data analytics. MEMTUNE dynamically tunes computation/caching memory partitions at runtime based on workload memory demand and in-memory data cache needs. Moreover, if needed, the scheduling information from the analytic framework isleveraged to evict data that will not be needed in the near future. Finally, MEMTUNE also supports task-level data prefetching with a configurable window size to more effectively overlap computation with I/O. Our experiments show that MEMTUNE improves memory utilization, yields an overall performance gain of up to 46%, and achieves cache hit ratio of up to 41% compared to standard Spark.

Proceedings Article
16 Mar 2016
TL;DR: It is found that the only way to achieve both isolation-guarantee and strategy-proofness is through blocking, which is efficiently adapted in a new policy called FairRide, which can lead to better cache efficiency and fairness in many scenarios.
Abstract: Memory caches continue to be a critical component to many systems. In recent years, there has been larger amounts of data into main memory, especially in shared environments such as the cloud. The nature of such environments requires resource allocations to provide both performance isolation for multiple users/applications and high utilization for the systems. We study the problem of fair allocation of memory cache for multiple users with shared files. We find that, surprisingly, no memory allocation policy can provide all three desirable properties (isolation-guarantee, strategy-proofness and Pareto-efficiency) that are typically achievable by other types of resources, e.g., CPU or network. We also show that there exist policies that achieve any two of the three properties. We find that the only way to achieve both isolation-guarantee and strategy-proofness is through blocking, which we efficiently adapt in a new policy called FairRide. We implement FairRide in a popular memory-centric storage system using an efficient form of blocking, named as expected delaying, and demonstrate that FairRide can lead to better cache efficiency (2.6× over isolated caches) and fairness in many scenarios.

Journal ArticleDOI
TL;DR: This article provides a survey on static cache analysis for real-time systems, presenting the challenges and static analysis techniques for independent programs with respect to different cache features, followed by a survey of existing tools based on static techniques for cache analysis.
Abstract: Real-time systems are reactive computer systems that must produce their reaction to a stimulus within given time bounds. A vital verification requirement is to estimate the Worst-Case Execution Time (WCET) of programs. These estimates are then used to predict the timing behavior of the overall system. The execution time of a program heavily depends on the underlying hardware, among which cache has the biggest influence. Analyzing cache behavior is very challenging due to the versatile cache features and complex execution environment. This article provides a survey on static cache analysis for real-time systems. We first present the challenges and static analysis techniques for independent programs with respect to different cache features. Then, the discussion is extended to cache analysis in complex execution environment, followed by a survey of existing tools based on static techniques for cache analysis. An outlook for future research is provided at last.

Proceedings Article
22 Feb 2016
TL;DR: The paper proposes a new cache demand model, Reuse Working Set (RWS), to capture only the data with good temporal locality, and uses the RWS size (RWSS) to model a workload's cache demand.
Abstract: Host-side flash caching has emerged as a promising solution to the scalability problem of virtual machine (VM) storage in cloud computing systems, but it still faces serious limitations in capacity and endurance. This paper presents CloudCache, an on-demand cache management solution to meet VM cache demands and minimize cache wear-out. First, to support on-demand cache allocation, the paper proposes a new cache demand model, Reuse Working Set (RWS), to capture only the data with good temporal locality, and uses the RWS size (RWSS) to model a workload's cache demand. By predicting the RWSS online and admitting only RWS into the cache, CloudCache satisfies the workload's actual cache demand and minimizes the induced wear-out. Second, to handle situations where a cache is insufficient for the VMs' demands, the paper proposes a dynamic cache migration approach to balance cache load across hosts by live migrating cached data along with the VMs. It includes both on-demand migration of dirty data and background migration of RWS to optimize the performance of the migrating VM. It also supports rate limiting on the cache data transfer to limit the impact to the co-hosted VMs. Finally, the paper presents comprehensive experimental evaluations using real-world traces to demonstrate the effectiveness of CloudCache.

Journal ArticleDOI
TL;DR: A novel cache management algorithm for flash-based disk cache, named Lazy Adaptive Replacement Cache (LARC), which can filter out seldom accessed blocks and prevent them from entering cache and improves performance and extends SSD lifetime at the same time.
Abstract: For years, the increasing popularity of flash memory has been changing storage systems. Flash-based solid-state drives (SSDs) are widely used as a new cache tier on top of hard disk drives (HDDs) to speed up data-intensive applications. However, the endurance problem of flash memory remains a concern and is getting worse with the adoption of MLC and TLC flash. In this article, we propose a novel cache management algorithm for flash-based disk cache named Lazy Adaptive Replacement Cache (LARC). LARC adopts the idea of selective caching to filter out seldom accessed blocks and prevent them from entering cache. This avoids cache pollution and preserves popular blocks in cache for a longer period of time, leading to a higher hit rate. Meanwhile, by avoiding unnecessary cache replacements, LARC reduces the volume of data written to the SSD and yields an SSD-friendly access pattern. In this way, LARC improves the performance and endurance of the SSD at the same time. LARC is self-tuning and incurs little overhead. It has been extensively evaluated by both trace-driven simulations and synthetic benchmarks on a prototype implementation. Our experiments show that LARC outperforms state-of-art algorithms for different kinds of workloads and extends SSD lifetime by up to 15.7 times.

Proceedings ArticleDOI
25 Feb 2016
TL;DR: STT-MRAM-based last level cache memory (LLC) is presented, using novel power optimization with high-speed power gating (HS-PG), considering processor architectures and cache memory accesses and has high reliability to reduce the write-error rate with novel write-verify-write.
Abstract: Two performance gaps in the memory hierarchy, between CPU cache and main memory, and main memory and mass storage, will become increasingly severe bottlenecks for computing-system performance. Although it is necessary to increase memory capacity to fill these gaps, power also increases when conventional volatile memories are used. A new nonvolatile memory for this purpose has been anticipated. Storage class memory is used to fill the second gap. Many candidates exist: ReRAM, PRAM, and 3D-cross point type with resistive change RAM. However, nonvolatile last level cache (LLC) is used to fill the first gap. Advanced STT-MRAM has achieved sub-4ns read and write accesses with perpendicular magnetic tunnel junctions (p-MTJ) [1–2]. Furthermore, mature integration processes have been developed and 8Mb STT-MRAM with sub-5ns operation has shown high reliability [3]. Moreover, because of its non-volatility, STT-MRAM can reduce operation energy by more than 81% compared to SRAM for cache [1]. This paper presents STT-MRAM-based last level cache memory (LLC) including MRAM memory core, peripherals and cache logic circuits, using novel power optimization with high-speed power gating (HS-PG),considering processor architectures and cache memory accesses. The STT-MRAM-based cache has high reliability to reduce the write-error rate with novel write-verify-write. Furthermore, a read-modify-write scheme is implemented to reduce active power without penalty. Figure 7.2.1 presents a block diagram of a 4Mb STT-MRAM based cache.

Proceedings ArticleDOI
12 Mar 2016
TL;DR: A new probabilistic cache model designed for high-performance replacement policies is presented that uses absolute reuse distances instead of stack distances, and models replacement policies as abstract ranking functions, which let us model arbitrary age-based replacement policies.
Abstract: Modern processors use high-performance cache replacement policies that outperform traditional alternatives like least-recently used (LRU). Unfortunately, current cache models do not capture these high-performance policies as most use stack distances, which are inherently tied to LRU or its variants. Accurate predictions of cache performance enable many optimizations in multicore systems. For example, cache partitioning uses these predictions to divide capacity among applications in order to maximize performance, guarantee quality of service, or achieve other system objectives. Without an accurate model for high-performance replacement policies, these optimizations are unavailable to modern processors. We present a new probabilistic cache model designed for high-performance replacement policies. It uses absolute reuse distances instead of stack distances, and models replacement policies as abstract ranking functions. These innovations let us model arbitrary age-based replacement policies. Our model achieves median error of less than 1% across several high-performance policies on both synthetic and SPEC CPU2006 benchmarks. Finally, we present a case study showing how to use the model to improve shared cache performance.

Proceedings Article
22 Jun 2016
TL;DR: Experimental results show that Ginseng for cache allocation improved clients' aggregated benefit by up to 42.8× compared with state-of-the-art static and dynamic algorithms.
Abstract: Cloud providers must dynamically allocate their physical resources to the right client to maximize the benefit that they can get out of given hardware. Cache Allocation Technology (CAT) makes it possible for the provider to allocate last level cache to virtual machines to prevent cache pollution. The provider can also allocate the cache to optimize client benefit. But how should it optimize client benefit, when it does not even know what the client plans to do? We present an auction-based mechanism that dynamically allocates cache while optimizing client benefit and improving hardware utilization. We evaluate our mechanism on benchmarks from the Phoronix Test Suite. Experimental results show that Ginseng for cache allocation improved clients' aggregated benefit by up to 42.8× compared with state-of-the-art static and dynamic algorithms.

Patent
12 Sep 2016
TL;DR: A cache memory has cache memory circuitry comprising a nonvolatile memory cell to store at least a portion of a data which is stored or is to be stored in a lower-level memory than the cache memory as mentioned in this paper.
Abstract: A cache memory has cache memory circuitry comprising a nonvolatile memory cell to store at least a portion of a data which is stored or is to be stored in a lower-level memory than the cache memory circuitry, a first redundancy code storage comprising a nonvolatile memory cell capable of storing a redundancy code of the data stored in the cache memory circuitry, and a second redundancy code storage comprising a volatile memory cell capable of storing the redundancy code.

Journal ArticleDOI
TL;DR: A write intensity predictor that realizes the idea by exploiting a correlation between write intensity of blocks and memory access instructions that incur cache misses of those blocks is designed and a hybrid cache architecture in which write-intensive blocks identified by the predictor are placed into the SRAM region is designed.
Abstract: Spin-transfer torque RAM (STT-RAM) has emerged as an energy-efficient and high-density alternative to SRAM for large on-chip caches. However, its high write energy has been considered as a serious drawback. Hybrid caches mitigate this problem by incorporating a small SRAM cache for write-intensive data along with an STT-RAM cache. In such architectures, choosing cache blocks to be placed into the SRAM cache is the key to their energy efficiency. This paper proposes a new hybrid cache architecture called prediction hybrid cache. The key idea is to predict write intensity of cache blocks at the time of cache misses and determine block placement based on the prediction. We design a write intensity predictor that realize the idea by exploiting a correlation between write intensity of blocks and memory access instructions that incur cache misses of those blocks. It includes a mechanism to dynamically adapt the predictor to application characteristics. We also design a hybrid cache architecture in which write-intensive blocks identified by the predictor are placed into the SRAM region. Evaluations show that our scheme reduces energy consumption of hybrid caches by 28 percent (31 percent) on average compared to the existing hybrid cache architecture in a single-core (quad-core) system.

Proceedings ArticleDOI
26 Sep 2016
TL;DR: This paper proposes a utility-driven cache partitioning approach to cache resource allocation among multiple content providers, where each provider is associated with each content provider a utility that is a function of the hit rate to its content.
Abstract: In-network cache deployment is recognized as an effective technique for reducing content access delay. Caches serve content from multiple content providers, and wish to provide them differentiated services due to monetary incentives and legal obligations. Partitioning is a common approach in providing differentiated storage services. In this paper, we propose a utility-driven cache partitioning approach to cache resource allocation among multiple content providers, where we associate with each content provider a utility that is a function of the hit rate to its content. A cache is partitioned into slices with each partition being dedicated to a particular content provider. We formulate an optimization problem where the objective is to maximize the sum of weighted utilities over all content providers through proper cache partitioning, and mathematically show its convexity. We also give a formal proof that partitioning the cache yields better performance compared to sharing it. We validate the effectiveness of cache partitioning through numerical evaluations, and investigate the impact of various factors (e.g., content popularity, request rate) on the hit rates observed by contending content providers.

Proceedings ArticleDOI
15 Oct 2016
TL;DR: OSCAR is proposed to Orchestrate STT-RAM Caches traffic for heterogeneous ARchitectures with an integration of asynchronous batch scheduling and priority based allocation for on-chip interconnect to maximize the potential of STt-RAM based LLC.
Abstract: As we integrate data-parallel GPUs with general-purpose CPUs on a single chip, the enormous cache traffic generated by GPUs will not only exhaust the limited cache capacity, but also severely interfere with CPU requests. Such heterogeneous multicores pose significant challenges to the design of shared last-level cache (LLC). This problem can be mitigated by replacing SRAM LLC with emerging non-volatile memories like Spin-Transfer Torque RAM (STT-RAM), which provides larger cache capacity and near-zero leakage power. However, without careful design, the slow write operations of STT-RAM may offset the capacity benefit, and the system may still suffer from contention in the shared LLC and on-chip interconnects. While there are cache optimization techniques to alleviate such problems, we reveal that the true potential of STT-RAM LLC may still be limited because now that the cache hit rate has been improved by the increased capacity, the on-chip network can become a performance bottleneck. CPU and GPU packets contend with each other for the shared network bandwidth. Moreover, the mixed-criticality read/write packets to STT-RAM add another layer of complexity to the network resource allocation. Therefore, being aware of the disparate latency tolerance of CPU/GPU applications and the asymmetric read/write latency of STT-RAM, we propose OSCAR to Orchestrate STT-RAM Caches traffic for heterogeneous ARchitectures. Specifically, an integration of asynchronous batch scheduling and priority based allocation for on-chip interconnect is proposed to maximize the potential of STT-RAM based LLC. Simulation results on a 28-GPU and 14-CPU system demonstrate an average of 17.4% performance improvement for CPUs, 10.8% performance improvement for GPUs, and 28.9% LLC energy saving compared to SRAM based LLC design.

Proceedings ArticleDOI
15 Mar 2016
TL;DR: This paper proposes a novel data-aware hybrid STT-RAM/SRAM cache architecture which stores data in the two partitions based on their bit counts and employs an asymmetric low-power 5T-SRAM structure which has high reliability for majority `one' data.
Abstract: Static Random Access Memories (SRAMs) occupy a large area of today's microprocessors, and are a prime source of leakage power in highly scaled technologies. Low leakage and high density Spin-Transfer Torque RAMs (STT-RAMs) are ideal candidates for a power-efficient memory. However, STT-RAM suffers from high write energy and latency, especially when writing ‘one’ data. In this paper we propose a novel data-aware hybrid STT-RAM/SRAM cache architecture which stores data in the two partitions based on their bit counts. To exploit the new resultant data distribution in the SRAM partition, we employ an asymmetric low-power 5T-SRAM structure which has high reliability for majority ‘one’ data. The proposed design significantly reduces the number of writes and hence dynamic energy in both STT-RAM and SRAM partitions. We employed a write cache policy and a small swap memory to control data migration between cache partitions. Our evaluation on UltraSPARC-III processor shows that utilizing STT-RAM/6T-SRAM and STT-RAM/5T-SRAM architectures for the L2 cache results in 42% and 53% energy efficiency, 9.3% and 9.1% performance improvement and 16.9% and 20.3% area efficiency respectively, with respect to SRAM-based cache running SPEC CPU 2006 benchmarks.

Proceedings ArticleDOI
12 Mar 2016
TL;DR: This work proposes, selective caching, wherein it disallow GPU caching of any memory that would require coherence updates to propagate between the CPU and GPU, thereby decoupling the GPU from vendor-specific CPU coherence protocols.
Abstract: Cache coherence is ubiquitous in shared memory multiprocessors because it provides a simple, high performance memory abstraction to programmers. Recent work suggests extending hardware cache coherence between CPUs and GPUs to help support programming models with tightly coordinated sharing between CPU and GPU threads. However, implementing hardware cache coherence is particularly challenging in systems with discrete CPUs and GPUs that may not be produced by a single vendor. Instead, we propose, selective caching, wherein we disallow GPU caching of any memory that would require coherence updates to propagate between the CPU and GPU, thereby decoupling the GPU from vendor-specific CPU coherence protocols. We propose several architectural improvements to offset the performance penalty of selective caching: aggressive request coalescing, CPU-side coherent caching for GPU-uncacheable requests, and a CPU-GPU interconnect optimization to support variable-size transfers. Moreover, current GPU workloads access many read-only memory pages; we exploit this property to allow promiscuous GPU caching of these pages, relying on page-level protection, rather than hardware cache coherence, to ensure correctness. These optimizations bring a selective caching GPU implementation to within 93% of a hardware cache-coherent implementation without the need to integrate CPUs and GPUs under a single hardware coherence protocol.

Proceedings ArticleDOI
15 Oct 2016
TL;DR: The Bunker Cache is proposed, a design that maps similar data to the same cache storage location based solely on their memory address, sacrificing some application quality loss for greater efficiency.
Abstract: The cost of moving and storing data is still a fundamental concern for computer architects. Inefficient handling of data can be attributed to conventional architectures being oblivious to the nature of the values that these data bits carry. We observe the phenomenon of spatio-value similarity, where data elements that are approximately similar in value exhibit spatial regularity in memory. This is inherent to 1) the data values of real-world applications, and 2) the way we store data structures in memory. We propose the Bunker Cache, a design that maps similar data to the same cache storage location based solely on their memory address, sacrificing some application quality loss for greater efficiency. The Bunker Cache enables performance gains (ranging from 1.08x to 1.19x) via reduced cache misses and energy savings (ranging from 1.18x to 1.39x) via reduced off-chip memory accesses and lower cache storage requirements. The Bunker Cache requires only modest changes to cache indexing hardware, integrating easily into commodity systems.

Proceedings ArticleDOI
Ning Zhang1, He Sun2, Kun Sun2, Wenjing Lou1, Y. Thomas Hou1 
21 Mar 2016
TL;DR: A new rootkit called CacheKit is developed that hides in the cache of the normal world and is able to evade memory introspection from the secure world and has small performance impacts on the rich OS.
Abstract: With the growing importance of networked embedded devices in the upcoming Internet of Things, new attacks targeting embedded OSes are emerging. ARM processors, which power over 60% of embedded devices, introduce a hardware security extension called TrustZone to protect secure applications in an isolated secure world that cannot be manipulated by a compromised OS in the normal world. LeveragingTrustZone technology, a number of memory integrity checking schemes have been proposed in the secure world to introspect malicious memory modification of the normal world. In this paper, we first discover and verify an ARM TrustZone cache incoherence behavior, which results in the cache contents of the two worlds, secure and non-secure, potentially being different even when they are mapped to the same physical address. Furthermore, code in one TrustZone world cannot access the cache content in the other world. Based on this observation, we develop a new rootkit called CacheKit that hides in the cache of the normal world and is able to evade memory introspection from the secure world. We implement a CacheKit prototype on Cortex-A8 processors after solving a number of challenges. First, we employ the Cache-as-RAM technique to ensure that the malicious code is only loaded into the CPU cache and not RAM. Thus, the secure world cannot detect the existence of the malicious code by examining the RAM. Second, we use the ARM processor's hardware support on cache settings to keep the malicious code persistent in the cache. Third, to evade introspection that flushes cache content back into RAM, we utilize physical addresses from the I/O address range that is not backed by any real I/O devices or RAM. The experimental results show that CacheKit can successfully evade memory introspection from the secure world and has small performance impacts on the rich OS. We discuss potential countermeasures to detect this type of rootkit attack.

Journal ArticleDOI
TL;DR: This work proposes a novel, simple, and effective wear-leveling technique with negligible performance overhead of <;0.4% for memory-intensive workloads and shows that the lifetime of the NV-cache is boosted up to 13× for different cache configurations.
Abstract: Emerging nonvolatile memory technologies, such as spin-transfer torque RAM or resistive RAM, can increase the capacity of the last-level cache (LLC) in a latency and power-efficient manner. These technologies endure $10^{9}$ – $10^{12}$ writes per cell, making a nonvolatile cache (NV-cache) with a lifetime of dozens of years under ideal working conditions. However, nonuniformity in writes to different cache lines considerably reduces the NV-cache lifetime to a few months. Writes to cache lines can be made uniformly by wear-leveling. A suitable wear-leveling for NV-cache should not incur high storage and performance overheads. We propose a novel, simple, and effective wear-leveling technique with negligible performance overhead of $13\times $ for different cache configurations.

Proceedings ArticleDOI
04 Aug 2016
TL;DR: NVMcached is designed and evaluated, a KV cache for non-volatile byte-addressable memory that can significantly reduce use of flushes and minimize data loss by leveraging consistency-friendly data structures and batched space allocation and reclamation.
Abstract: As byte-addressable, high-density, and non-volatile memory (NVM) is around the corner to be equipped alongside the DRAM memory, issues on enabling the important key-value cache services, such as memcached, on the new storage medium must be addressed. While NVM allows data in a KV cache to survive power outage and system crash, in practice their integrity and accessibility depend on data consistency enforced during writes to NVM. Though techniques for enforcing the consistency, such as journaling, COW, or checkpointing, are available, they are often too expensive by frequently using CPU cache flushes to ensure crash consistency, leading to (much) reduced performance and excessively compromised NVM's lifetime. In this paper we design and evaluate NVMcached, a KV cache for non-volatile byte-addressable memory that can significantly reduce use of flushes and minimize data loss by leveraging consistency-friendly data structures and batched space allocation and reclamation. Experiments show that NVMcached can improve its system throughput by up to 2.8x for write-intensive real-world workloads, compared to a non-volatile memcached.