Showing papers on "Cache coloring published in 2016"

PDF

Open Access

Proceedings Article•DOI•

Fundamental Limits of Cache-Aided Interference Management

[...]

Navid Naderializadeh¹, Mohammad Ali Maddah-Ali², A. Salman Avestimehr¹•Institutions (2)

University of Southern California¹, Bell Labs²

12 Feb 2016-arXiv: Information Theory

TL;DR: In this article, the authors considered a cache-aided wireless network with a library of files and showed that the sum degrees-of-freedom (sum-DoF) of the network is within a factor of 2 of the optimum under one-shot linear schemes.

...read moreread less

Abstract: We consider a system comprising a library of $N$ files (e.g., movies) and a wireless network with $K_T$ transmitters, each equipped with a local cache of size of $M_T$ files, and $K_R$ receivers, each equipped with a local cache of size of $M_R$ files. Each receiver will ask for one of the $N$ files in the library, which needs to be delivered. The objective is to design the cache placement (without prior knowledge of receivers' future requests) and the communication scheme to maximize the throughput of the delivery. In this setting, we show that the sum degrees-of-freedom (sum-DoF) of $\min\left\{\frac{K_T M_T+K_R M_R}{N},K_R\right\}$ is achievable, and this is within a factor of 2 of the optimum, under one-shot linear schemes. This result shows that (i) the one-shot sum-DoF scales linearly with the aggregate cache size in the network (i.e., the cumulative memory available at all nodes), (ii) the transmitters' and receivers' caches contribute equally in the one-shot sum-DoF, and (iii) caching can offer a throughput gain that scales linearly with the size of the network. To prove the result, we propose an achievable scheme that exploits the redundancy of the content at transmitters' caches to cooperatively zero-force some outgoing interference and availability of the unintended content at receivers' caches to cancel (subtract) some of the incoming interference. We develop a particular pattern for cache placement that maximizes the overall gains of cache-aided transmit and receive interference cancellations. For the converse, we present an integer optimization problem which minimizes the number of communication blocks needed to deliver any set of requested files to the receivers. We then provide a lower bound on the value of this optimization problem, hence leading to an upper bound on the linear one-shot sum-DoF of the network, which is within a factor of 2 of the achievable sum-DoF.

...read moreread less

190 citations

Proceedings Article•DOI•

On the optimality of uncoded cache placement

[...]

Kai Wan¹, Daniela Tuninetti², Pablo Piantanida³•Institutions (3)

University of Paris-Sud¹, University of Illinois at Chicago², Centre national de la recherche scientifique³

11 Sep 2016

TL;DR: In this article, when the cache contents and the user demands are fixed, the authors connect the caching problem to an index coding problem and show the optimality of the MAN scheme under the conditions that the cache placement phase is restricted to be uncoded (i.e., pieces of the files can only copied into the user's cache).

...read moreread less

Abstract: Caching is an effective way to reduce peak-hour network traffic congestion by storing some contents at user's local cache. Maddah-Ali and Niesen (MAN) initiated a fundamental study of caching systems by proposing a scheme (with uncoded cache placement and linear network coding delivery) that is provably optimal to within a factor 4.7. In this paper, when the cache contents and the user demands are fixed, we connect the caching problem to an index coding problem and show the optimality of the MAN scheme under the conditions that (i) the cache placement phase is restricted to be uncoded (i.e, pieces of the files can only copied into the user's cache), and (ii) the number of users is no more than the number of files. As a consequence, further improvements to the MAN scheme are only possible through the use of coded cache placement.

...read moreread less

188 citations

Proceedings Article•DOI•

Popularity-driven content caching

[...]

S. Y. Li¹, Jie Xu², Mihaela van der Schaar³, Weiping Li¹•Institutions (3)

University of Science and Technology of China¹, University of Miami², University of California, Los Angeles³

10 Apr 2016

TL;DR: It is proved that the learning regret of PopCaching is sublinear in the number of content requests, and it is converges fast and asymptotically achieves the optimal cache hit rate.

...read moreread less

Abstract: This paper presents a novel cache replacement method — Popularity-Driven Content Caching (PopCaching). PopCaching learns the popularity of content and uses it to determine which content it should store and which it should evict from the cache. Popularity is learned in an online fashion, requires no training phase and hence, it is more responsive to continuously changing trends of content popularity. We prove that the learning regret of PopCaching (i.e., the gap between the hit rate achieved by PopCaching and that by the optimal caching policy with hindsight) is sublinear in the number of content requests. Therefore, PopCaching converges fast and asymptotically achieves the optimal cache hit rate. We further demonstrate the effectiveness of PopCaching by applying it to a movie.douban.com dataset that contains over 38 million requests. Our results show significant cache hit rate lift compared to existing algorithms, and the improvements can exceed 40% when the cache capacity is limited. In addition, PopCaching has low complexity.

...read moreread less

146 citations

Proceedings Article•DOI•

Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family

[...]

Andrew J. Herdrich¹, Verplanke Edwin¹, Autee Priya¹, Ramesh Illikkal¹, Chris C. Gianos¹, Ronak Singhal¹, Ravi Iyer¹ - Show less +3 more•Institutions (1)

Intel¹

12 Mar 2016

TL;DR: This paper will present the first set of shared cache QoS techniques designed and implemented in state-of-the-art commercial servers (the Intel® Xeon® processor E5-2600 v3 product family) and describe two key technologies: Cache Monitoring Technology (CMT) to enable monitoring of shared caches usage by different applications and Cache Allocation Technology(CAT) which enables redistribution of shared Cache space between applications to address contention.

...read moreread less

Abstract: Over the last decade, addressing quality of service (QoS) in multi-core server platforms has been growing research topic. QoS techniques have been proposed to address the shared resource contention between co-running applications or virtual machines in servers and thereby provide better isolation, performance determinism and potentially improve overall throughput. One of the most important shared resources is cache space. Most proposals for addressing shared cache contention are based on simulations and analysis and no commercial platforms were available that integrated such techniques and provided a practical solution. In this paper, we will present the first set of shared cache QoS techniques designed and implemented in state-of-the-art commercial servers (the Intel® Xeon® processor E5-2600 v3 product family). We will describe two key technologies: (i) Cache Monitoring Technology (CMT) to enable monitoring of shared cache usage by different applications and (ii) Cache Allocation Technology (CAT) which enables redistribution of shared cache space between applications to address contention. This is the first paper to describing these techniques as they moved from concept to reality, starting from early research to product implementation. We will also present case studies highlighting the value of these techniques using example scenarios of multi-programmed workloads, virtualized platforms in datacenters and communications platforms. Finally, we will describe initial software infrastructure and enabling for industry practitioners and researchers to take advantage of these technologies for their QoS needs.

...read moreread less

136 citations

Journal Article•DOI•

Coding for caching: fundamental limits and practical challenges

[...]

Mohammad Ali Maddah-Ali¹, Urs Niesen²•Institutions (2)

Bell Labs¹, Qualcomm²

01 Aug 2016-IEEE Communications Magazine

TL;DR: Both the fundamental performance limits of cache networks and the practical challenges that need to be overcome in real-life scenarios are discussed.

...read moreread less

Abstract: Caching is an essential technique to improve throughput and latency in a vast variety of applications. The core idea is to duplicate content in memories distributed across the network, which can then be exploited to deliver requested content with less congestion and delay. The traditional role of cache memories is to deliver the maximal amount of requested content locally rather than from a remote server. While this approach is optimal for single-cache systems, it has recently been shown to be significantly suboptimal for systems with multiple caches (i.e., cache networks). Instead, cache memories should be used to enable a coded multicasting gain. In this article, we survey these recent developments. We discuss both the fundamental performance limits of cache networks and the practical challenges that need to be overcome in real-life scenarios.

...read moreread less

107 citations

Proceedings Article•

Cliffhanger: scaling performance cliffs in web memory caches

[...]

Asaf Cidon¹, Assaf Eisenman¹, Mohammad Alizadeh², Sachin Katti¹•Institutions (2)

Stanford University¹, Massachusetts Institute of Technology²

16 Mar 2016

TL;DR: Cliffhanger is developed, a lightweight iterative algorithm that runs on memory cache servers, which incrementally optimizes the resource allocations across and within applications based on dynamically changing workloads and designs a novel technique for dealing with performance cliffs incrementally and locally.

...read moreread less

Abstract: Web-scale applications are heavily reliant on memory cache systems such as Memcached to improve throughput and reduce user latency. Small performance improvements in these systems can result in large end-to-end gains. For example, a marginal increase in hit rate of 1% can reduce the application layer latency by over 35%. However, existing web cache resource allocation policies are workload oblivious and first-come-first-serve. By analyzing measurements from a widely used caching service, Memcachier, we demonstrate that existing cache allocation techniques leave significant room for improvement. We develop Cliffhanger, a lightweight iterative algorithm that runs on memory cache servers, which incrementally optimizes the resource allocations across and within applications based on dynamically changing workloads. It has been shown that cache allocation algorithms underperform when there are performance cliffs, in which minor changes in cache allocation cause large changes in the hit rate. We design a novel technique for dealing with performance cliffs incrementally and locally. We demonstrate that for the Memcachier applications, on average, Cliffhanger increases the overall hit rate 1.2%, reduces the total number of cache misses by 36.7% and achieves the same hit rate with 45% less memory capacity.

...read moreread less

99 citations

Journal Article•DOI•

Trend-Aware Video Caching Through Online Learning

[...]

S. Y. Li¹, Jie Xu², Mihaela van der Schaar³, Weiping Li¹•Institutions (3)

University of Science and Technology of China¹, University of Miami², University of California, Los Angeles³

01 Dec 2016-IEEE Transactions on Multimedia

TL;DR: It is proved that the gap between the hit rate achieved by Trend-Caching and that by the optimal caching policy with hindsight is sublinear in the number of video requests, thereby guaranteeing both fast convergence and asymptotically optimal cache hit rate.

...read moreread less

Abstract: This paper presents Trend-Caching, a novel cache replacement method that optimizes cache performance according to the trends of video content. Trend-Caching explicitly learns the popularity trend of video content and uses it to determine which video it should store and which it should evict from the cache. Popularity is learned in an online fashion and requires no training phase, hence it is more responsive to continuously changing trends of videos. We prove that the learning regret of Trend-Caching (i.e., the gap between the hit rate achieved by Trend-Caching and that by the optimal caching policy with hindsight) is sublinear in the number of video requests, thereby guaranteeing both fast convergence and asymptotically optimal cache hit rate. We further validate the effectiveness of Trend-Caching by applying it to a movie.douban.com dataset that contains over 38 million requests. Our results show significant cache hit rate lift compared to existing algorithms, and the improvements can exceed 40% when the cache capacity is limited. Furthermore, Trend-Caching has low complexity.

...read moreread less

97 citations

Journal Article•DOI•

Newcache: Secure Cache Architecture Thwarting Cache Side-Channel Attacks

[...]

Fangfei Liu¹, Hao Wu¹, Kenneth Mai², Ruby B. Lee¹•Institutions (2)

Princeton University¹, Carnegie Mellon University²

01 Sep 2016-IEEE Micro

TL;DR: An improved design of Newcache is presented, in terms of security, circuit design and simplicity, and it is shown that Newcache can be used as L1 data and instruction caches to improve security without impacting performance.

...read moreread less

Abstract: Newcache is a secure cache that can thwart cache side-channel attacks to prevent the leakage of secret information All caches today are susceptible to cache side-channel attacks, despite software isolation of memory pages in virtual address spaces or virtual machines These cache attacks can leak secret encryption keys or private identity keys, nullifying any protection provided by strong cryptography Newcache uses a novel dynamic, randomized memory-to-cache mapping to thwart contention-based side-channel attacks, rather than the static mapping used by conventional set-associative caches In this article, the authors present an improved design of Newcache, in terms of security, circuit design and simplicity They show Newcache's security against a suite of cache side-channel attacks They evaluate Newcache's system performance for cloud computing, smartphone, and SPEC benchmarks and find that Newcache performs as well as conventional set-associative caches, and sometimes better They also designed a VLSI test chip with a 32-Kbyte Newcache and a 32-Kbyte, eight-way, set-associative cache and verified that the access latency, power, and area of the two caches are comparable These results show that Newcache can be used as L1 data and instruction caches to improve security without impacting performance

...read moreread less

96 citations

Journal Article•DOI•

Accelerating dependent cache misses with an enhanced memory controller

[...]

Milad Hashemi¹, Khubaib², Eiman Ebrahimi³, Onur Mutlu⁴, Yale N. Patt¹ - Show less +1 more•Institutions (4)

University of Texas at Austin¹, Apple Inc.², Nvidia³, Carnegie Mellon University⁴

18 Jun 2016

TL;DR: This work proposes adding just enough functionality to dynamically identify instructions at the core and migrate them to the memory controller for execution as soon as source data arrives from DRAM, allowing memory requests issued by the new Enhanced Memory Controller to experience a 20% lower latency than ifissued by the core.

...read moreread less

Abstract: On-chip contention increases memory access latency for multicore processors. We identify that this additional latency has a substantial efect on performance for an important class of latency-critical memory operations: those that result in a cache miss and are dependent on data from a prior cache miss. We observe that the number of instructions between the frst cache miss and its dependent cache miss is usually small. To minimize dependent cache miss latency, we propose adding just enough functionality to dynamically identify these instructions at the core and migrate them to the memory controller for execution as soon as source data arrives from DRAM. This migration allows memory requests issued by our new Enhanced Memory Controller (EMC) to experience a 20% lower latency than if issued by the core. On a set of memory intensive quad-core workloads, the EMC results in a 13% improvement in system performance and a 5% reduction in energy consumption over a system with a Global History Bufer prefetcher, the highest performing prefetcher in our evaluation.

...read moreread less

94 citations

Proceedings Article•DOI•

SecDCP: secure dynamic cache partitioning for efficient timing channel protection

[...]

Yao Wang¹, Andrew Ferraiuolo¹, Danfeng Zhang², Andrew C. Myers¹, G. Edward Suh¹ - Show less +1 more•Institutions (2)

Cornell University¹, Pennsylvania State University²

05 Jun 2016

TL;DR: The proposed SecDCP scheme changes the size of cache partitions at run time for better performance while preventing insecure information leakage between processes, and improves performance by up to 43% and by an average of 12.5% over static cache partitioning.

...read moreread less

Abstract: In today's multicore processors, the last-level cache is often shared by multiple concurrently running processes to make efficient use of hardware resources. However, previous studies have shown that a shared cache is vulnerable to timing channel attacks that leak confidential information from one process to another. Static cache partitioning can eliminate the cache timing channels but incurs significant performance overhead. In this paper, we propose Secure Dynamic Cache Partitioning (SecDCP), a partitioning technique that defeats cache timing channel attacks. The SecDCP scheme changes the size of cache partitions at run time for better performance while preventing insecure information leakage between processes. For cache-sensitive multiprogram workloads, our experimental results show that SecDCP improves performance by up to 43% and by an average of 12.5% over static cache partitioning.

...read moreread less

92 citations

Proceedings Article•DOI•

CaSE: Cache-Assisted Secure Execution on ARM Processors

[...]

Ning Zhang¹, Kun Sun², Wenjing Lou¹, Y. Thomas Hou¹•Institutions (2)

University of Virginia¹, College of William & Mary²

22 May 2016

TL;DR: CaSE utilizes TrustZone and Cache-as-RAM technique to create a cache-based isolated execution environment, which can protect both code and data of security-sensitive applications against the compromised OS and the cold boot attack.

...read moreread less

Abstract: Recognizing the pressing demands to secure embedded applications, ARM TrustZone has been adopted in both academic research and commercial products to protect sensitive code and data in a privileged, isolated execution environment. However, the design of TrustZone cannot prevent physical memory disclosure attacks such as cold boot attack from gaining unrestricted read access to the sensitive contents in the dynamic random access memory (DRAM). A number of system-on-chip (SoC) bound execution solutions have been proposed to thaw the cold boot attack by storing sensitive data only in CPU registers, CPU cache or internal RAM. However, when the operating system, which is responsible for creating and maintaining the SoC-bound execution environment, is compromised, all the sensitive data is leaked. In this paper, we present the design and development of a cache-assisted secure execution framework, called CaSE, on ARM processors to defend against sophisticated attackers who can launch multi-vector attacks including software attacks and hardware memory disclosure attacks. CaSE utilizes TrustZone and Cache-as-RAM technique to create a cache-based isolated execution environment, which can protect both code and data of security-sensitive applications against the compromised OS and the cold boot attack. To protect the sensitive code and data against cold boot attack, applications are encrypted in memory and decrypted only within the processor for execution. The memory separation and the cache separation provided by TrustZone are used to protect the cached applications against compromised OS. We implement a prototype of CaSE on the i.MX53 running ARM Cortex-A8 processor. The experimental results show that CaSE incurs small impacts on system performance when executing cryptographic algorithms including AES, RSA, and SHA1.

...read moreread less

Journal Article•DOI•

Design and Evaluation of the Optimal Cache Allocation for Content-Centric Networking

[...]

Yonggong Wang¹, Zhenyu Li¹, Gareth Tyson², Steve Uhlig², Gaogang Xie¹ - Show less +1 more•Institutions (2)

Chinese Academy of Sciences¹, Queen Mary University of London²

01 Jan 2016-IEEE Transactions on Computers

TL;DR: This work focuses on the cache allocation problem, namely, how to distribute the cache capacity across routers under a constrained total storage budget for the network and proposes a suboptimal heuristic method based on node centrality, which is more practical in dynamic networks with frequent content publishing.

...read moreread less

Abstract: Content-centric networking (CCN) is a promising framework to rebuild the Internet's forwarding substrate around the concept of content. CCN advocates ubiquitous in-network caching to enhance content delivery, and thus each router has storage space to cache frequently requested content. In this work, we focus on the cache allocation problem, namely, how to distribute the cache capacity across routers under a constrained total storage budget for the network. We first formulate this problem as a content placement problem and obtain the optimal solution by a two-step method. We then propose a suboptimal heuristic method based on node centrality, which is more practical in dynamic networks with frequent content publishing. We investigate through simulations the factors that affect the optimal cache allocation, and perhaps more importantly we use a real-life Internet topology and video access logs from a large scale Internet video provider to evaluate the performance of various cache allocation methods. We observe that network topology and content popularity are two important factors that affect where exactly should cache capacity be placed. Further, the heuristic method comes with only a very limited performance penalty compared to the optimal allocation. Finally, using our findings, we provide recommendations for network operators on the best deployment of CCN caches capacity over routers.

...read moreread less

Proceedings Article•DOI•

MEMTUNE: Dynamic Memory Management for In-Memory Data Analytic Platforms

[...]

Luna Xu¹, Min Li², Li Zhang², Ali R. Butt¹, Yandong Wang², Zane Zhenhua Hu³ - Show less +2 more•Institutions (3)

Virginia Tech¹, IBM², Platform Computing³

23 May 2016

TL;DR: MEMTUNE dynamically tunes computation/caching memory partitions at runtime based on workload memory demand and in-memory data cache needs, and if needed, the scheduling information from the analytic framework is leveraged to evict data that will be needed in the near future.

...read moreread less

Abstract: Memory is a crucial resource for big data processing frameworks such as Spark and M3R, where the memory is used both for computation and for caching intermediate storage data. Consequently, optimizing memory is the key to extracting high performance. The extant approach is to statically split thememory for computation and caching based on workload profiling. This approach is unable to capture the varying workload characteristics and dynamic memory demands. Another factor that affects caching efficiency is the choice of data placement and eviction policy. The extant LRU policy is oblivious of task scheduling information from the analytic frameworks, and thus can lead to lost optimization opportunities. In this paper, we address the above issues by designing MEMTUNE, a dynamic memory manager for in-memory data analytics. MEMTUNE dynamically tunes computation/caching memory partitions at runtime based on workload memory demand and in-memory data cache needs. Moreover, if needed, the scheduling information from the analytic framework isleveraged to evict data that will not be needed in the near future. Finally, MEMTUNE also supports task-level data prefetching with a configurable window size to more effectively overlap computation with I/O. Our experiments show that MEMTUNE improves memory utilization, yields an overall performance gain of up to 46%, and achieves cache hit ratio of up to 41% compared to standard Spark.

...read moreread less

Proceedings Article•

FairRide: near-optimal, fair cache sharing

[...]

Qifan Pu¹, Haoyuan Li¹, Matei Zaharia², Ali Ghodsi¹, Ion Stoica¹ - Show less +1 more•Institutions (2)

University of California, Berkeley¹, Massachusetts Institute of Technology²

16 Mar 2016

TL;DR: It is found that the only way to achieve both isolation-guarantee and strategy-proofness is through blocking, which is efficiently adapted in a new policy called FairRide, which can lead to better cache efficiency and fairness in many scenarios.

...read moreread less

Abstract: Memory caches continue to be a critical component to many systems. In recent years, there has been larger amounts of data into main memory, especially in shared environments such as the cloud. The nature of such environments requires resource allocations to provide both performance isolation for multiple users/applications and high utilization for the systems. We study the problem of fair allocation of memory cache for multiple users with shared files. We find that, surprisingly, no memory allocation policy can provide all three desirable properties (isolation-guarantee, strategy-proofness and Pareto-efficiency) that are typically achievable by other types of resources, e.g., CPU or network. We also show that there exist policies that achieve any two of the three properties. We find that the only way to achieve both isolation-guarantee and strategy-proofness is through blocking, which we efficiently adapt in a new policy called FairRide. We implement FairRide in a popular memory-centric storage system using an efficient form of blocking, named as expected delaying, and demonstrate that FairRide can lead to better cache efficiency (2.6× over isolated caches) and fairness in many scenarios.

...read moreread less

Journal Article•DOI•

A Survey on Static Cache Analysis for Real-Time Systems

[...]

Mingsong Lv¹, Nan Guan¹, Jan Reineke², Reinhard Wilhelm², Wang Yi³ - Show less +1 more•Institutions (3)

Northeastern University (China)¹, Saarland University², Uppsala University³

29 Jun 2016-Leibniz Transactions on Embedded Systems

TL;DR: This article provides a survey on static cache analysis for real-time systems, presenting the challenges and static analysis techniques for independent programs with respect to different cache features, followed by a survey of existing tools based on static techniques for cache analysis.

...read moreread less

Abstract: Real-time systems are reactive computer systems that must produce their reaction to a stimulus within given time bounds. A vital verification requirement is to estimate the Worst-Case Execution Time (WCET) of programs. These estimates are then used to predict the timing behavior of the overall system. The execution time of a program heavily depends on the underlying hardware, among which cache has the biggest influence. Analyzing cache behavior is very challenging due to the versatile cache features and complex execution environment. This article provides a survey on static cache analysis for real-time systems. We first present the challenges and static analysis techniques for independent programs with respect to different cache features. Then, the discussion is extended to cache analysis in complex execution environment, followed by a survey of existing tools based on static techniques for cache analysis. An outlook for future research is provided at last.

...read moreread less

Proceedings Article•

CloudCache: on-demand flash cache management for Cloud Computing

[...]

Dulcardo Arteaga¹, Jorge Cabrera¹, Jing Xu², Swaminathan Sundararaman, Ming Zhao³ - Show less +1 more•Institutions (3)

Florida International University¹, VMware², Arizona State University³

22 Feb 2016

TL;DR: The paper proposes a new cache demand model, Reuse Working Set (RWS), to capture only the data with good temporal locality, and uses the RWS size (RWSS) to model a workload's cache demand.

...read moreread less

Abstract: Host-side flash caching has emerged as a promising solution to the scalability problem of virtual machine (VM) storage in cloud computing systems, but it still faces serious limitations in capacity and endurance. This paper presents CloudCache, an on-demand cache management solution to meet VM cache demands and minimize cache wear-out. First, to support on-demand cache allocation, the paper proposes a new cache demand model, Reuse Working Set (RWS), to capture only the data with good temporal locality, and uses the RWS size (RWSS) to model a workload's cache demand. By predicting the RWSS online and admitting only RWS into the cache, CloudCache satisfies the workload's actual cache demand and minimizes the induced wear-out. Second, to handle situations where a cache is insufficient for the VMs' demands, the paper proposes a dynamic cache migration approach to balance cache load across hosts by live migrating cached data along with the VMs. It includes both on-demand migration of dirty data and background migration of RWS to optimize the performance of the migrating VM. It also supports rate limiting on the cache data transfer to limit the impact to the co-hosted VMs. Finally, the paper presents comprehensive experimental evaluations using real-world traces to demonstrate the effectiveness of CloudCache.

...read moreread less

Journal Article•DOI•

Improving Flash-Based Disk Cache with Lazy Adaptive Replacement

[...]

Sai Huang¹, Qingsong Wei², Dan Feng¹, Jianxi Chen¹, Cheng Chen² - Show less +1 more•Institutions (2)

Huazhong University of Science and Technology¹, Agency for Science, Technology and Research²

26 Feb 2016-ACM Transactions on Storage

TL;DR: A novel cache management algorithm for flash-based disk cache, named Lazy Adaptive Replacement Cache (LARC), which can filter out seldom accessed blocks and prevent them from entering cache and improves performance and extends SSD lifetime at the same time.

...read moreread less

Abstract: For years, the increasing popularity of flash memory has been changing storage systems. Flash-based solid-state drives (SSDs) are widely used as a new cache tier on top of hard disk drives (HDDs) to speed up data-intensive applications. However, the endurance problem of flash memory remains a concern and is getting worse with the adoption of MLC and TLC flash. In this article, we propose a novel cache management algorithm for flash-based disk cache named Lazy Adaptive Replacement Cache (LARC). LARC adopts the idea of selective caching to filter out seldom accessed blocks and prevent them from entering cache. This avoids cache pollution and preserves popular blocks in cache for a longer period of time, leading to a higher hit rate. Meanwhile, by avoiding unnecessary cache replacements, LARC reduces the volume of data written to the SSD and yields an SSD-friendly access pattern. In this way, LARC improves the performance and endurance of the SSD at the same time. LARC is self-tuning and incurs little overhead. It has been extensively evaluated by both trace-driven simulations and synthetic benchmarks on a prototype implementation. Our experiments show that LARC outperforms state-of-art algorithms for different kinds of workloads and extends SSD lifetime by up to 15.7 times.

...read moreread less

Proceedings Article•DOI•

7.2 4Mb STT-MRAM-based cache with memory-access-aware power optimization and write-verify-write / read-modify-write scheme

[...]

Hiroki Noguchi¹, Kazutaka Ikegami¹, Satoshi Takaya¹, Eishi Arima², Keiichi Kushida¹, Atsushi Kawasumi¹, Hiroyuki Hara¹, Keiko Abe¹, Naoharu Shimomura¹, Junichi Ito¹, Shinobu Fujita¹, Takashi Nakada², Hiroshi Nakamura² - Show less +9 more•Institutions (2)

Toshiba¹, University of Tokyo²

25 Feb 2016

TL;DR: STT-MRAM-based last level cache memory (LLC) is presented, using novel power optimization with high-speed power gating (HS-PG), considering processor architectures and cache memory accesses and has high reliability to reduce the write-error rate with novel write-verify-write.

...read moreread less

Abstract: Two performance gaps in the memory hierarchy, between CPU cache and main memory, and main memory and mass storage, will become increasingly severe bottlenecks for computing-system performance. Although it is necessary to increase memory capacity to fill these gaps, power also increases when conventional volatile memories are used. A new nonvolatile memory for this purpose has been anticipated. Storage class memory is used to fill the second gap. Many candidates exist: ReRAM, PRAM, and 3D-cross point type with resistive change RAM. However, nonvolatile last level cache (LLC) is used to fill the first gap. Advanced STT-MRAM has achieved sub-4ns read and write accesses with perpendicular magnetic tunnel junctions (p-MTJ) [1–2]. Furthermore, mature integration processes have been developed and 8Mb STT-MRAM with sub-5ns operation has shown high reliability [3]. Moreover, because of its non-volatility, STT-MRAM can reduce operation energy by more than 81% compared to SRAM for cache [1]. This paper presents STT-MRAM-based last level cache memory (LLC) including MRAM memory core, peripherals and cache logic circuits, using novel power optimization with high-speed power gating (HS-PG),considering processor architectures and cache memory accesses. The STT-MRAM-based cache has high reliability to reduce the write-error rate with novel write-verify-write. Furthermore, a read-modify-write scheme is implemented to reduce active power without penalty. Figure 7.2.1 presents a block diagram of a 4Mb STT-MRAM based cache.

...read moreread less

Proceedings Article•DOI•

Modeling cache performance beyond LRU

[...]

Nathan Beckmann¹, Daniel Sanchez¹•Institutions (1)

Massachusetts Institute of Technology¹

12 Mar 2016

TL;DR: A new probabilistic cache model designed for high-performance replacement policies is presented that uses absolute reuse distances instead of stack distances, and models replacement policies as abstract ranking functions, which let us model arbitrary age-based replacement policies.

...read moreread less

Abstract: Modern processors use high-performance cache replacement policies that outperform traditional alternatives like least-recently used (LRU). Unfortunately, current cache models do not capture these high-performance policies as most use stack distances, which are inherently tied to LRU or its variants. Accurate predictions of cache performance enable many optimizations in multicore systems. For example, cache partitioning uses these predictions to divide capacity among applications in order to maximize performance, guarantee quality of service, or achieve other system objectives. Without an accurate model for high-performance replacement policies, these optimizations are unavailable to modern processors. We present a new probabilistic cache model designed for high-performance replacement policies. It uses absolute reuse distances instead of stack distances, and models replacement policies as abstract ranking functions. These innovations let us model arbitrary age-based replacement policies. Our model achieves median error of less than 1% across several high-performance policies on both synthetic and SPEC CPU2006 benchmarks. Finally, we present a case study showing how to use the model to improve shared cache performance.

...read moreread less

Proceedings Article•

Ginseng: market-driven LLC allocation

[...]

Liran Funaro¹, Orna Agmon Ben-Yehuda¹, Assaf Schuster¹•Institutions (1)

Technion – Israel Institute of Technology¹

22 Jun 2016

TL;DR: Experimental results show that Ginseng for cache allocation improved clients' aggregated benefit by up to 42.8× compared with state-of-the-art static and dynamic algorithms.

...read moreread less

Abstract: Cloud providers must dynamically allocate their physical resources to the right client to maximize the benefit that they can get out of given hardware. Cache Allocation Technology (CAT) makes it possible for the provider to allocate last level cache to virtual machines to prevent cache pollution. The provider can also allocate the cache to optimize client benefit. But how should it optimize client benefit, when it does not even know what the client plans to do? We present an auction-based mechanism that dynamically allocates cache while optimizing client benefit and improving hardware utilization. We evaluate our mechanism on benchmarks from the Phoronix Test Suite. Experimental results show that Ginseng for cache allocation improved clients' aggregated benefit by up to 42.8× compared with state-of-the-art static and dynamic algorithms.

...read moreread less

Patent•

Cache memory and processor system

[...]

Kazutaka Ikegami¹, Shinobu Fujita¹, Hiroki Noguchi¹•Institutions (1)

Toshiba¹

12 Sep 2016

TL;DR: A cache memory has cache memory circuitry comprising a nonvolatile memory cell to store at least a portion of a data which is stored or is to be stored in a lower-level memory than the cache memory as mentioned in this paper.

...read moreread less

Abstract: A cache memory has cache memory circuitry comprising a nonvolatile memory cell to store at least a portion of a data which is stored or is to be stored in a lower-level memory than the cache memory circuitry, a first redundancy code storage comprising a nonvolatile memory cell capable of storing a redundancy code of the data stored in the cache memory circuitry, and a second redundancy code storage comprising a volatile memory cell capable of storing the redundancy code.

...read moreread less

Journal Article•DOI•

Prediction Hybrid Cache: An Energy-Efficient STT-RAM Cache Architecture

[...]

Junwhan Ahn¹, Sungjoo Yoo¹, Kiyoung Choi¹•Institutions (1)

Seoul National University¹

01 Mar 2016-IEEE Transactions on Computers

TL;DR: A write intensity predictor that realizes the idea by exploiting a correlation between write intensity of blocks and memory access instructions that incur cache misses of those blocks is designed and a hybrid cache architecture in which write-intensive blocks identified by the predictor are placed into the SRAM region is designed.

...read moreread less

Abstract: Spin-transfer torque RAM (STT-RAM) has emerged as an energy-efficient and high-density alternative to SRAM for large on-chip caches. However, its high write energy has been considered as a serious drawback. Hybrid caches mitigate this problem by incorporating a small SRAM cache for write-intensive data along with an STT-RAM cache. In such architectures, choosing cache blocks to be placed into the SRAM cache is the key to their energy efficiency. This paper proposes a new hybrid cache architecture called prediction hybrid cache. The key idea is to predict write intensity of cache blocks at the time of cache misses and determine block placement based on the prediction. We design a write intensity predictor that realize the idea by exploiting a correlation between write intensity of blocks and memory access instructions that incur cache misses of those blocks. It includes a mechanism to dynamically adapt the predictor to application characteristics. We also design a hybrid cache architecture in which write-intensive blocks identified by the predictor are placed into the SRAM region. Evaluations show that our scheme reduces energy consumption of hybrid caches by 28 percent (31 percent) on average compared to the existing hybrid cache architecture in a single-core (quad-core) system.

...read moreread less

Proceedings Article•DOI•

On Allocating Cache Resources to Content Providers

[...]

Weibo Chu¹, Mostafa Dehghan², Don Towsley², Zhi-Li Zhang³•Institutions (3)

Northwestern Polytechnical University¹, University of Massachusetts Amherst², University of Minnesota³

26 Sep 2016

TL;DR: This paper proposes a utility-driven cache partitioning approach to cache resource allocation among multiple content providers, where each provider is associated with each content provider a utility that is a function of the hit rate to its content.

...read moreread less

Abstract: In-network cache deployment is recognized as an effective technique for reducing content access delay. Caches serve content from multiple content providers, and wish to provide them differentiated services due to monetary incentives and legal obligations. Partitioning is a common approach in providing differentiated storage services. In this paper, we propose a utility-driven cache partitioning approach to cache resource allocation among multiple content providers, where we associate with each content provider a utility that is a function of the hit rate to its content. A cache is partitioned into slices with each partition being dedicated to a particular content provider. We formulate an optimization problem where the objective is to maximize the sum of weighted utilities over all content providers through proper cache partitioning, and mathematically show its convexity. We also give a formal proof that partitioning the cache yields better performance compared to sharing it. We validate the effectiveness of cache partitioning through numerical evaluations, and investigate the impact of various factors (e.g., content popularity, request rate) on the hit rates observed by contending content providers.

...read moreread less

Proceedings Article•DOI•

OSCAR: orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures

[...]

Jia Zhan¹, Onur Kayiran², Gabriel H. Loh², Chita R. Das³, Yuan Xie¹ - Show less +1 more•Institutions (3)

University of California, Santa Barbara¹, Advanced Micro Devices², Pennsylvania State University³

15 Oct 2016

TL;DR: OSCAR is proposed to Orchestrate STT-RAM Caches traffic for heterogeneous ARchitectures with an integration of asynchronous batch scheduling and priority based allocation for on-chip interconnect to maximize the potential of STt-RAM based LLC.

...read moreread less

Abstract: As we integrate data-parallel GPUs with general-purpose CPUs on a single chip, the enormous cache traffic generated by GPUs will not only exhaust the limited cache capacity, but also severely interfere with CPU requests. Such heterogeneous multicores pose significant challenges to the design of shared last-level cache (LLC). This problem can be mitigated by replacing SRAM LLC with emerging non-volatile memories like Spin-Transfer Torque RAM (STT-RAM), which provides larger cache capacity and near-zero leakage power. However, without careful design, the slow write operations of STT-RAM may offset the capacity benefit, and the system may still suffer from contention in the shared LLC and on-chip interconnects. While there are cache optimization techniques to alleviate such problems, we reveal that the true potential of STT-RAM LLC may still be limited because now that the cache hit rate has been improved by the increased capacity, the on-chip network can become a performance bottleneck. CPU and GPU packets contend with each other for the shared network bandwidth. Moreover, the mixed-criticality read/write packets to STT-RAM add another layer of complexity to the network resource allocation. Therefore, being aware of the disparate latency tolerance of CPU/GPU applications and the asymmetric read/write latency of STT-RAM, we propose OSCAR to Orchestrate STT-RAM Caches traffic for heterogeneous ARchitectures. Specifically, an integration of asynchronous batch scheduling and priority based allocation for on-chip interconnect is proposed to maximize the potential of STT-RAM based LLC. Simulation results on a 28-GPU and 14-CPU system demonstrate an average of 17.4% performance improvement for CPUs, 10.8% performance improvement for GPUs, and 28.9% LLC energy saving compared to SRAM based LLC design.

...read moreread less

Proceedings Article•DOI•

Low power data-aware STT-RAM based hybrid cache architecture

[...]

Mohsen Imani¹, Shruti Patil¹, Tajana Rosing¹•Institutions (1)

University of California, San Diego¹

15 Mar 2016

TL;DR: This paper proposes a novel data-aware hybrid STT-RAM/SRAM cache architecture which stores data in the two partitions based on their bit counts and employs an asymmetric low-power 5T-SRAM structure which has high reliability for majority `one' data.

...read moreread less

Abstract: Static Random Access Memories (SRAMs) occupy a large area of today's microprocessors, and are a prime source of leakage power in highly scaled technologies. Low leakage and high density Spin-Transfer Torque RAMs (STT-RAMs) are ideal candidates for a power-efficient memory. However, STT-RAM suffers from high write energy and latency, especially when writing ‘one’ data. In this paper we propose a novel data-aware hybrid STT-RAM/SRAM cache architecture which stores data in the two partitions based on their bit counts. To exploit the new resultant data distribution in the SRAM partition, we employ an asymmetric low-power 5T-SRAM structure which has high reliability for majority ‘one’ data. The proposed design significantly reduces the number of writes and hence dynamic energy in both STT-RAM and SRAM partitions. We employed a write cache policy and a small swap memory to control data migration between cache partitions. Our evaluation on UltraSPARC-III processor shows that utilizing STT-RAM/6T-SRAM and STT-RAM/5T-SRAM architectures for the L2 cache results in 42% and 53% energy efficiency, 9.3% and 9.1% performance improvement and 16.9% and 20.3% area efficiency respectively, with respect to SRAM-based cache running SPEC CPU 2006 benchmarks.

...read moreread less

Proceedings Article•DOI•

Selective GPU caches to eliminate CPU-GPU HW cache coherence

[...]

Neha Agarwal¹, David Nellans², Eiman Ebrahimi², Thomas F. Wenisch¹, John M. Danskin², Stephen W. Keckler² - Show less +2 more•Institutions (2)

University of Michigan¹, Nvidia²

12 Mar 2016

TL;DR: This work proposes, selective caching, wherein it disallow GPU caching of any memory that would require coherence updates to propagate between the CPU and GPU, thereby decoupling the GPU from vendor-specific CPU coherence protocols.

...read moreread less

Abstract: Cache coherence is ubiquitous in shared memory multiprocessors because it provides a simple, high performance memory abstraction to programmers. Recent work suggests extending hardware cache coherence between CPUs and GPUs to help support programming models with tightly coordinated sharing between CPU and GPU threads. However, implementing hardware cache coherence is particularly challenging in systems with discrete CPUs and GPUs that may not be produced by a single vendor. Instead, we propose, selective caching, wherein we disallow GPU caching of any memory that would require coherence updates to propagate between the CPU and GPU, thereby decoupling the GPU from vendor-specific CPU coherence protocols. We propose several architectural improvements to offset the performance penalty of selective caching: aggressive request coalescing, CPU-side coherent caching for GPU-uncacheable requests, and a CPU-GPU interconnect optimization to support variable-size transfers. Moreover, current GPU workloads access many read-only memory pages; we exploit this property to allow promiscuous GPU caching of these pages, relying on page-level protection, rather than hardware cache coherence, to ensure correctness. These optimizations bring a selective caching GPU implementation to within 93% of a hardware cache-coherent implementation without the need to integrate CPUs and GPUs under a single hardware coherence protocol.

...read moreread less

Proceedings Article•DOI•

The bunker cache for spatio-value approximation

[...]

Joshua San Miguel¹, Jorge Albericio¹, Natalie Enright Jerger¹, Aamer Jaleel²•Institutions (2)

University of Toronto¹, Nvidia²

15 Oct 2016

TL;DR: The Bunker Cache is proposed, a design that maps similar data to the same cache storage location based solely on their memory address, sacrificing some application quality loss for greater efficiency.

...read moreread less

Abstract: The cost of moving and storing data is still a fundamental concern for computer architects. Inefficient handling of data can be attributed to conventional architectures being oblivious to the nature of the values that these data bits carry. We observe the phenomenon of spatio-value similarity, where data elements that are approximately similar in value exhibit spatial regularity in memory. This is inherent to 1) the data values of real-world applications, and 2) the way we store data structures in memory. We propose the Bunker Cache, a design that maps similar data to the same cache storage location based solely on their memory address, sacrificing some application quality loss for greater efficiency. The Bunker Cache enables performance gains (ranging from 1.08x to 1.19x) via reduced cache misses and energy savings (ranging from 1.18x to 1.39x) via reduced off-chip memory accesses and lower cache storage requirements. The Bunker Cache requires only modest changes to cache indexing hardware, integrating easily into commodity systems.

...read moreread less

Proceedings Article•DOI•

CacheKit: Evading Memory Introspection Using Cache Incoherence

[...]

Ning Zhang¹, He Sun², Kun Sun², Wenjing Lou¹, Y. Thomas Hou¹ - Show less +1 more•Institutions (2)

Virginia Tech¹, College of William & Mary²

21 Mar 2016

TL;DR: A new rootkit called CacheKit is developed that hides in the cache of the normal world and is able to evade memory introspection from the secure world and has small performance impacts on the rich OS.

...read moreread less

Abstract: With the growing importance of networked embedded devices in the upcoming Internet of Things, new attacks targeting embedded OSes are emerging. ARM processors, which power over 60% of embedded devices, introduce a hardware security extension called TrustZone to protect secure applications in an isolated secure world that cannot be manipulated by a compromised OS in the normal world. LeveragingTrustZone technology, a number of memory integrity checking schemes have been proposed in the secure world to introspect malicious memory modification of the normal world. In this paper, we first discover and verify an ARM TrustZone cache incoherence behavior, which results in the cache contents of the two worlds, secure and non-secure, potentially being different even when they are mapped to the same physical address. Furthermore, code in one TrustZone world cannot access the cache content in the other world. Based on this observation, we develop a new rootkit called CacheKit that hides in the cache of the normal world and is able to evade memory introspection from the secure world. We implement a CacheKit prototype on Cortex-A8 processors after solving a number of challenges. First, we employ the Cache-as-RAM technique to ensure that the malicious code is only loaded into the CPU cache and not RAM. Thus, the secure world cannot detect the existence of the malicious code by examining the RAM. Second, we use the ARM processor's hardware support on cache settings to keep the malicious code persistent in the cache. Third, to evade introspection that flushes cache content back into RAM, we utilize physical addresses from the I/O address range that is not backed by any real I/O devices or RAM. The experimental results show that CacheKit can successfully evade memory introspection from the secure world and has small performance impacts on the rich OS. We discuss potential countermeasures to detect this type of rootkit attack.

...read moreread less

Journal Article•DOI•

Sequoia: A High-Endurance NVM-Based Cache Architecture

[...]

Mohammad Reza Akbari Jokar¹, Mohammad Arjomand¹, Hamid Sarbazi-Azad¹•Institutions (1)

Sharif University of Technology¹

01 Mar 2016-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: This work proposes a novel, simple, and effective wear-leveling technique with negligible performance overhead of <;0.4% for memory-intensive workloads and shows that the lifetime of the NV-cache is boosted up to 13× for different cache configurations.

...read moreread less

Abstract: Emerging nonvolatile memory technologies, such as spin-transfer torque RAM or resistive RAM, can increase the capacity of the last-level cache (LLC) in a latency and power-efficient manner. These technologies endure $10^{9}$ – $10^{12}$ writes per cell, making a nonvolatile cache (NV-cache) with a lifetime of dozens of years under ideal working conditions. However, nonuniformity in writes to different cache lines considerably reduces the NV-cache lifetime to a few months. Writes to cache lines can be made uniformly by wear-leveling. A suitable wear-leveling for NV-cache should not incur high storage and performance overheads. We propose a novel, simple, and effective wear-leveling technique with negligible performance overhead of $13\times $ for different cache configurations.

...read moreread less

Proceedings Article•DOI•

NVMcached: An NVM-based Key-Value Cache

[...]

Xingbo Wu¹, Fan Ni¹, Li Zhang², Yandong Wang², Yufei Ren², Michel H. T. Hack², Zili Shao³, Song Jiang¹ - Show less +4 more•Institutions (3)

Wayne State University¹, IBM², Hong Kong Polytechnic University³

04 Aug 2016

TL;DR: NVMcached is designed and evaluated, a KV cache for non-volatile byte-addressable memory that can significantly reduce use of flushes and minimize data loss by leveraging consistency-friendly data structures and batched space allocation and reclamation.

...read moreread less

Abstract: As byte-addressable, high-density, and non-volatile memory (NVM) is around the corner to be equipped alongside the DRAM memory, issues on enabling the important key-value cache services, such as memcached, on the new storage medium must be addressed. While NVM allows data in a KV cache to survive power outage and system crash, in practice their integrity and accessibility depend on data consistency enforced during writes to NVM. Though techniques for enforcing the consistency, such as journaling, COW, or checkpointing, are available, they are often too expensive by frequently using CPU cache flushes to ensure crash consistency, leading to (much) reduced performance and excessively compromised NVM's lifetime. In this paper we design and evaluate NVMcached, a KV cache for non-volatile byte-addressable memory that can significantly reduce use of flushes and minimize data loss by leveraging consistency-friendly data structures and batched space allocation and reclamation. Experiments show that NVMcached can improve its system throughput by up to 2.8x for write-intensive real-world workloads, compared to a non-volatile memcached.

...read moreread less

Collapse