scispace - formally typeset
Search or ask a question

Showing papers on "Cache pollution published in 2017"


Proceedings Article
01 Feb 2017
TL;DR: This paper presents the Compute Cache architecture that enables in-place computation in caches, which uses emerging bit-line SRAM circuit technology to repurpose existing cache elements and transforms them into active very large vector computational units.
Abstract: This paper presents the Compute Cache architecturethat enables in-place computation in caches. ComputeCaches uses emerging bit-line SRAM circuit technology to repurpose existing cache elements and transforms them into active very large vector computational units. Also, it significantlyreduces the overheads in moving data between different levelsin the cache hierarchy. Solutions to satisfy new constraints imposed by ComputeCaches such as operand locality are discussed. Also discussedare simple solutions to problems in integrating them into aconventional cache hierarchy while preserving properties suchas coherence, consistency, and reliability. Compute Caches increase performance by 1.9× and reduceenergy by 2.4× for a suite of data-centric applications, includingtext and database query processing, cryptographic kernels, and in-memory checkpointing. Applications with larger fractionof Compute Cache operations could benefit even more, asour micro-benchmarks indicate (54× throughput, 9× dynamicenergy savings).

225 citations


Proceedings Article
16 Aug 2017
TL;DR: Cloak, a new technique that uses hardware transactional memory to prevent adversarial observation of cache misses on sensitive code and data, provides strong protection against all known cache-based side-channel attacks with low performance overhead.
Abstract: Cache-based side-channel attacks are a serious problem in multi-tenant environments, for example, modern cloud data centers. We address this problem with Cloak, a new technique that uses hardware transactional memory to prevent adversarial observation of cache misses on sensitive code and data. We show that Cloak provides strong protection against all known cache-based side-channel attacks with low performance overhead. We demonstrate the efficacy of our approach by retrofitting vulnerable code with Cloak and experimentally confirming immunity against state-of-the-art attacks. We also show that by applying Cloak to code running inside Intel SGX enclaves we can effectively block information leakage through cache side channels from enclaves, thus addressing one of the main weaknesses of SGX.

194 citations


Proceedings ArticleDOI
24 Jun 2017
TL;DR: This paper proposes to alter the line replacement algorithm of the shared cache, to prevent a process from creating inclusion victims in the caches of cores running other processes, and calls it SHARP (Secure Hierarchy-Aware cache Replacement Policy).
Abstract: In cache-based side channel attacks, a spy that shares a cache with a victim probes cache locations to extract information on the victim's access patterns. For example, in evict+reload, the spy repeatedly evicts and then reloads a probe address, checking if the victim has accessed the address in between the two operations. While there are many proposals to combat these cache attacks, they all have limitations: they either hurt performance, require programmer intervention, or can only defend against some types of attacks.This paper makes the following observation for an environment with an inclusive cache hierarchy: when the spy evicts the probe address from the shared cache, the address will also be evicted from the private cache of the victim process, creating an inclusion victim. Consequently, to disable cache attacks, this paper proposes to alter the line replacement algorithm of the shared cache, to prevent a process from creating inclusion victims in the caches of cores running other processes. By enforcing this rule, the spy cannot evict the probe address from the shared cache and, hence, cannot glimpse any information on the victim's access patterns. We call our proposal SHARP (Secure Hierarchy-Aware cache Replacement Policy). SHARP efficiently defends against all existing cross-core shared-cache attacks, needs only minimal hardware modifications, and requires no code modifications. We implement SHARP in a cycle-level full-system simulator. We show that it protects against real-world attacks, and that it introduces negligible average performance degradation.

109 citations


Proceedings ArticleDOI
25 Jun 2017
TL;DR: In this article, the authors considered a basic caching system, where a single server with a database of N files (e.g. movies) is connected to a set of K users through a shared bottleneck link.
Abstract: We consider a basic caching system, where a single server with a database of N files (e.g. movies) is connected to a set of K users through a shared bottleneck link. Each user has a local cache memory with a size of M files. The system operates in two phases: a placement phase, where each cache memory is populated up to its size from the database, and a following delivery phase, where each user requests a file from the database, and the server is responsible for delivering the requested contents. The objective is to design the two phases to minimize the load (peak or average) of the bottleneck link. We characterize the rate-memory tradeoff of the above caching system within a factor of 2.00884 for both the peak rate and the average rate (under uniform file popularity), where the best proved characterization in the current literature gives a factor of 4 and 4.7 respectively. Moreover, in the practically important case where the number of files (N) is large, we exactly characterize the tradeoff for systems with no more than 5 users, and characterize the tradeoff within a factor of 2 otherwise. We establish these results by developing novel information theoretic outer-bounds for the caching problem, which improves the state of the art and gives tight characterization in various cases.

81 citations


Proceedings ArticleDOI
14 Oct 2017
TL;DR: A novel probabilistic information flow graph is proposed to model the interaction between the victim program, the attacker program and the cache architecture, and a new metric, the Probability of Attack Success (PAS), is derived, which gives a quantitative measure for evaluating a cache’s resilience against a given class of cache side-channel attacks.
Abstract: Security-critical data can leak through very unexpected side channels, making side-channel attacks very dangerous threats to information security. Of these, cache-based side-channel attacks are some of the most problematic. This is because caches are essential for the performance of modern computers, but an intrinsic property of all caches – the different access times for cache hits and misses – is the property exploited to leak information in time-based cache side-channel attacks. Recently, different secure cache architectures have been proposed to defend against these attacks. However, we do not have a reliable method for evaluating a cache’s resilience against different classes of cache side-channel attacks, which is the goal of this paper.We first propose a novel probabilistic information flow graph (PIFG) to model the interaction between the victim program, the attacker program and the cache architecture. From this model, we derive a new metric, the Probability of Attack Success (PAS), which gives a quantitative measure for evaluating a cache’s resilience against a given class of cache side-channel attacks. We show the generality of our model and metric by applying them to evaluate nine different cache architectures against all four classes of cache side-channel attacks. Our new methodology, model and metric can help verify the security provided by different proposed secure cache architectures, and compare them in terms of their resilience to cache side-channel attacks, without the need for simulation or taping out a chip.CCS CONCEPTS• Security and privacy $\rightarrow $ Side-channel analysis and counter-measures; • General and reference $\rightarrow$ Evaluation; • Computer systems organization $\rightarrow $ Processors and memory architectures;

72 citations


Proceedings ArticleDOI
Meng Xu1, Linh Thi, Xuan Phan1, Hyon-Young Choi1, Insup Lee1 
01 Apr 2017
TL;DR: In this paper, the authors present vCAT, a novel design for dynamic shared cache management on multicore virtualization platforms based on Intel's cache allocation technology (CAT), which achieves strong isolation at both task and VM levels through cache partition virtualization.
Abstract: This paper presents vCAT, a novel design for dynamic shared cache management on multicore virtualization platforms based on Intel's Cache Allocation Technology (CAT). Our design achieves strong isolation at both task and VM levels through cache partition virtualization, which works in a similar way as memory virtualization, but has challenges that are unique to cache and CAT. To demonstrate the feasibility and benefits of our design, we provide a prototype implementation of vCAT, and we present an extensive set of microbenchmarks and performance evaluation results on the PARSEC benchmarks and synthetic workloads, for both static and dynamic allocations. The evaluation results show that (i) vCAT can be implemented with minimal overhead, (ii) it can be used to mitigate shared cache interference, which could have caused task WCET increased by up to 7.2×, (iii) static management in vCAT can increase system utilization by up to 7× compared to a system without cache management, and (iv) dynamic management substantially outperforms static management in terms of schedulable utilization (increase by up to 3× in our multi-mode example use case).

60 citations


Proceedings ArticleDOI
19 Mar 2017
TL;DR: This paper proposes an optimization framework for cache placement and delivery schemes which explicitly accounts for the heterogeneity of the cache sizes, and characterize explicitly the optimal caching scheme, for the case where the sum of the users' cache sizes is smaller than or equal to the library size.
Abstract: Coded caching can improve fundamental limits of communication, utilizing storage memory at individual users. This paper considers a centralized coded caching system, introducing heterogeneous cache sizes at the users, i.e., the users' cache memories are of different size. The goal is to design cache placement and delivery policies that minimize the worst-case delivery load on the server. To that end, the paper proposes an optimization framework for cache placement and delivery schemes which explicitly accounts for the heterogeneity of the cache sizes. We also characterize explicitly the optimal caching scheme, for the case where the sum of the users' cache sizes is smaller than or equal to the library size.

59 citations


Proceedings ArticleDOI
24 Jun 2017
TL;DR: This paper proposes Access Pattern-aware Cache Management (APCM), which dynamically detects the locality type of each load instruction by monitoring the accesses from one exemplary warp, and uses the detected locality type to selectively apply cache bypassing and cache pinning of data based on load locality characterization.
Abstract: Long latency of memory operation is a prominent performance bottleneck in graphics processing units (GPUs). The small data cache that must be shared across dozens of warps (a collection of threads) creates significant cache contention and premature data eviction. Prior works have recognized this problem and proposed warp throttling which reduces the number of active warps contending for cache space. In this paper we discover that individual load instructions in a warp exhibit four different types of data locality behavior: (1) data brought by a warp load instruction is used only once, which is classified as streaming data (2) data brought by a warp load is reused multiple times within the same warp, called intra-warp locality (3) data brought by a warp is reused multiple times but across different warps, called inter-warp locality (4) and some data exhibit both a mix of intra- and inter-warp locality. Furthermore, each load instruction exhibits consistently the same locality type across all warps within a GPU kernel. Based on this discovery we argue that cache management must be done using per-load locality type information, rather than applying warp-wide cache management policies. We propose Access Pattern-aware Cache Management (APCM), which dynamically detects the locality type of each load instruction by monitoring the accesses from one exemplary warp. APCM then uses the detected locality type to selectively apply cache bypassing and cache pinning of data based on load locality characterization. Using an extensive set of simulations we show that APCM improves performance of GPUs by 34% for cache sensitive applications while saving 27% of energy consumption over baseline GPU.

57 citations


Proceedings Article
12 Jul 2017
TL;DR: Web application performance heavily relies on the hit rate of DRAM key-value caches, and Memshare provides a resource sharing model that guarantees reserved memory to different applications while dynamically pooling and sharing the remaining memory to optimize overall hit rate.
Abstract: Web application performance heavily relies on the hit rate of DRAM key-value caches. Current DRAM caches statically partition memory across applications that share the cache. This results in under utilization and limits cache hit rates. We present Memshare, a DRAM key-value cache that dynamically manages memory across applications. Memshare provides a resource sharing model that guarantees reserved memory to different applications while dynamically pooling and sharing the remaining memory to optimize overall hit rate. Key-value caches are typically memory capacity bound, which leaves cache server CPU and memory bandwidth idle. Memshare leverages these resources with a log-structured design that allows it to provide better hit rates than conventional caches by dynamically repartitioning memory among applications. We implemented Memshare and ran it on a week-long trace from a commercial memcached provider. Memshare increases the combined hit rate of the applications in the trace from 84.7% to 90.8%, and it reduces the total number of misses by 39.7% without significantly affecting cache throughput or latency. Even for single-tenant applications, Memshare increases the average hit rate of the state-of-the-art key-value cache by an additional 2.7%.

57 citations


Journal ArticleDOI
TL;DR: The CLCE replication scheme reduces the redundant caching of contents; hence improves the cache space utilization and LFRU approximates the least frequently used scheme coupled with the least recently used scheme and is practically implementable for rapidly changing cache networks like ICNs.
Abstract: To cope with the ongoing changing demands of the internet, ‘in-network caching’ has been presented as an application solution for two decades. With the advent of information-centric network (ICN) architecture, ‘in-network caching’ becomes a network level solution. Some unique features of the ICNs, e.g., rapidly changing cache states, higher request arrival rates, smaller cache sizes, and other factors, impose diverse requirements on the content eviction policies. In particular, eviction policies should be fast and lightweight. In this paper, we propose cache replication and eviction schemes, conditional leave cope everywhere (CLCE) and least frequent recently used (LFRU), which are well suited for the ICN type of cache networks (CNs). The CLCE replication scheme reduces the redundant caching of contents; hence improves the cache space utilization. LFRU approximates the least frequently used scheme coupled with the least recently used scheme and is practically implementable for rapidly changing cache networks like ICNs.

57 citations


Proceedings ArticleDOI
24 Jun 2017
TL;DR: Jenga is proposed, a reconfigurable cache hierarchy that dynamically and transparently specializes itself to applications, and builds virtual cache hierarchies out of heterogeneous, distributed cache banks using simple hardware mechanisms and an OS runtime.
Abstract: Caches are traditionally organized as a rigid hierarchy, with multiple levels of progressively larger and slower memories. Hierarchy allows a simple, fixed design to benefit a wide range of applications, since working sets settle at the smallest (i.e., fastest and most energy-efficient) level they fit in. However, rigid hierarchies also add overheads, because each level adds latency and energy even when it does not fit the working set. These overheads are expensive on emerging systems with heterogeneous memories, where the differences in latency and energy across levels are small. Significant gains are possible by specializing the hierarchy to applications.We propose Jenga, a reconfigurable cache hierarchy that dynamically and transparently specializes itself to applications. Jenga builds virtual cache hierarchies out of heterogeneous, distributed cache banks using simple hardware mechanisms and an OS runtime. In contrast to prior techniques that trade energy and bandwidth for performance (e.g., dynamic bypassing or prefetching), Jenga eliminates accesses to unwanted cache levels. Jenga thus improves both performance and energy efficiency. On a 36-core chip with a 1 GB DRAM cache, Jenga improves energy-delay product over a combination of state-of-the-art techniques by 23% on average and by up to 85%.

Proceedings ArticleDOI
24 Jun 2017
TL;DR: DICE is proposed, a dynamic design that can adapt between spatial indexing and TSI, depending on the compressibility of the data, and low-cost Cache Index Predictors (CIP) that can accurately predict the cache indexing scheme on access in order to avoid probing both indices for retrieving a given cache line.
Abstract: This paper investigates compression for DRAM caches. As the capacity of DRAM cache is typically large, prior techniques on cache compression, which solely focus on improving cache capacity, provide only a marginal benefit. We show that more performance benefit can be obtained if the compression of the DRAM cache is tailored to provide higher bandwidth. If a DRAM cache can provide two compressed lines in a single access, and both lines are useful, the effective bandwidth of the DRAM cache would double. Unfortunately, it is not straight-forward to compress DRAM caches for bandwidth. The typically used Traditional Set Indexing (TSI) maps consecutive lines to consecutive sets, so the multiple compressed lines obtained from the set are from spatially distant locations and unlikely to be used within a short period of each other. We can change the indexing of the cache to place consecutive lines in the same set to improve bandwidth; however, when the data is incompressible, such spatial indexing reduces effective capacity and causes significant slowdown.Ideally, we would like to have spatial indexing when the data is compressible and TSI otherwise. To this end, we propose Dynamic-Indexing Cache comprEssion (DICE), a dynamic design that can adapt between spatial indexing and TSI, depending on the compressibility of the data. We also propose low-cost Cache Index Predictors (CIP) that can accurately predict the cache indexing scheme on access in order to avoid probing both indices for retrieving a given cache line. Our studies with a 1GB DRAM cache, on a wide range of workloads (including SPEC and Graph), show that DICE improves performance by 19.0% and reduces energy-delay-product by 36% on average. DICE is within 3% of a design that has double the capacity and double the bandwidth. DICE incurs a storage overhead of less than 1KB and does not rely on any OS support.

Proceedings ArticleDOI
04 Apr 2017
TL;DR: This paper proposes a holistic cache management technique called Kill-the-PC (KPC) that overcomes the weaknesses of traditional prefetching and replacement policy algorithms and removes the need to propagate the PC through entire on-chip cache hierarchy while providing a holistic caches management approach with better performance.
Abstract: Data prefetching and cache replacement algorithms have been intensively studied in the design of high performance microprocessors. Typically, the data prefetcher operates in the private caches and does not interact with the replacement policy in the shared Last-Level Cache (LLC). Similarly, most replacement policies do not consider demand and prefetch requests as different types of requests. In particular, program counter (PC)-based replacement policies cannot learn from prefetch requests since the data prefetcher does not generate a PC value. PC-based policies can also be negatively affected by compiler optimizations. In this paper, we propose a holistic cache management technique called Kill-the-PC (KPC) that overcomes the weaknesses of traditional prefetching and replacement policy algorithms. KPC cache management has three novel contributions. First, a prefetcher which approximates the future use distance of prefetch requests based on its prediction confidence. Second, a simple replacement policy provides similar or better performance than current state-of-the-art PC-based prediction using global hysteresis. Third, KPC integrates prefetching and replacement policy into a whole system which is greater than the sum of its parts. Information from the prefetcher is used to improve the performance of the replacement policy and vice-versa. Finally, KPC removes the need to propagate the PC through entire on-chip cache hierarchy while providing a holistic cache management approach with better performance than state-of-the-art PC-, and non-PC-based schemes. Our evaluation shows that KPC provides 8% better performance than the best combination of existing prefetcher and replacement policy for multi-core workloads.

Proceedings Article
27 Feb 2017
TL;DR: Dong et al. as mentioned in this paper proposed a co-design approach to bridge the semantic gap between the key-value cache manager and the underlying flash devices, which can maximize the efficiency of key-values caching on flash devices while minimizing its weakness.
Abstract: In recent years, flash-based key-value cache systems have raised high interest in industry, such as Facebook's McDipper and Twitter's Fatcache. These cache systems typically use commercial SSDs to store and manage key-value cache data in flash. Such a practice, though simple, is inefficient due to the huge semantic gap between the key-value cache manager and the underlying flash devices. In this paper, we advocate to reconsider the cache system design and directly open device-level details of the underlying flash storage for key-value caching. This co-design approach bridges the semantic gap and well connects the two layers together, which allows us to leverage both the domain knowledge of key-value caches and the unique device properties. In this way, we can maximize the efficiency of key-value caching on flash devices while minimizing its weakness. We implemented a prototype, called DIDACache, based on the Open-Channel SSD platform. Our experiments on real hardware show that we can significantly increase the throughput by 35.5%, reduce the latency by 23.6%, and remove unnecessary erase operations by 28%.

Journal ArticleDOI
TL;DR: This article presents a survey of techniques for partitioning shared caches in multicore processors, categorize the techniques based on important characteristics and provides a bird’s eye view of the field of cache partitioning.
Abstract: As the number of on-chip cores and memory demands of applications increase, judicious management of cache resources has become not merely attractive but imperative. Cache partitioning, that is, dividing cache space between applications based on their memory demands, is a promising approach to provide capacity benefits of shared cache with performance isolation of private caches. However, naively partitioning the cache may lead to performance loss, unfairness, and lack of quality-of-service guarantees. It is clear that intelligent techniques are required for realizing the full potential of cache partitioning. In this article, we present a survey of techniques for partitioning shared caches in multicore processors. We categorize the techniques based on important characteristics and provide a bird’s eye view of the field of cache partitioning.

Proceedings ArticleDOI
01 Oct 2017
TL;DR: CoMon++ is presented, a framework for lightweight coordination that protects from cache pollution and further attacks in NDN, and efficiently and effectively prevents cache pollution, remarkably outperforming a very notable state-of-the-art solution.
Abstract: Defending against cache pollution attacks, highly detrimental attacks that are easy to implement in Named-Data Networking (NDN), currently suffers from the lack of coordination. Solving cache pollution attacks is a prerequisite for the deployment of NDN, which is widely considered to be the basis for the future Internet. We present CoMon++ to this end, a framework for lightweight coordination that protects from cache pollution and further attacks in NDN. Our simulation studies demonstrate that CoMon++ efficiently and effectively prevents cache pollution, remarkably outperforming a very notable state-of-the-art solution.

Proceedings ArticleDOI
01 Sep 2017
TL;DR: A new metric – Live Distance – is introduced that uses the stack distance to learn the temporal reuse characteristics of cache blocks, thus enabling a dead block predictor that is robust to variability in generational behavior.
Abstract: The looming breakdown of Moore's Law and the end of voltage scaling are ushering a new era where neither transistors nor the energy to operate them is free. This calls for a new regime in computer systems, one in which every transistor counts. Caches are essential for processor performance and represent the bulk of modern processor's transistor budget. To get more performance out of the cache hierarchy, future processors will rely on effective cache management policies.This paper identifies variability in generational behavior of cache blocks as a key challenge for cache management policies that aim to identify dead blocks as early and as accurately as possible to maximize cache efficiency. We show that existing management policies are limited by the metrics they use to identify dead blocks, leading to low coverage and/or low accuracy in the face of variability. In response, we introduce a new metric – Live Distance – that uses the stack distance to learn the temporal reuse characteristics of cache blocks, thus enabling a dead block predictor that is robust to variability in generational behavior. Based on the reuse characteristics of an application's cache blocks, our predictor – Leeway – classifies application's behavior as streaming-oriented or reuse-oriented and dynamically selects an appropriate cache management policy. By leveraging live distance for LLC management, Leeway outperforms state-of-the-art approaches on single- and multi-core SPEC and manycore CloudSuite workloads.

Proceedings ArticleDOI
01 May 2017
TL;DR: This paper considers centralized coded caching, where the server not only designs the users' cache contents, but also assigns their cache sizes under a total cache memory budget.
Abstract: This paper considers centralized coded caching, where the server not only designs the users' cache contents, but also assigns their cache sizes under a total cache memory budget. The server is connected to each user via a link of given finite capacity. For given link capacities and total memory budget, we minimize the worst-case delivery completion time by jointly optimizing the cache sizes, the cache placement and delivery schemes. The optimal memory allocation and caching scheme are characterized explicitly for the case where the total memory budget is smaller than that of the server library. Numerical results confirm the savings in delivery time obtained by optimizing the memory allocation.

Journal ArticleDOI
21 Nov 2017
TL;DR: This paper presents a new cache replacement policy that takes advantage of a hierarchical caching architecture, and in particular of access-time difference between memory and disk, and significantly reduces the hard-disk load.
Abstract: Most of the caching algorithms are oblivious to requests’ timescale, but caching systems are capacity constrained and, in practical cases, the hit rate may be limited by the cache’s impossibility to serve requests fast enough. In particular, the hard-disk access time can be the key factor capping cache performance. In this article, we present a new cache replacement policy that takes advantage of a hierarchical caching architecture, and in particular of access-time difference between memory and disk. Our policy is optimal when requests follow the independent reference model and significantly reduces the hard-disk load, as shown also by our realistic, trace-driven evaluation. Moreover, we show that our policy can be considered in a more general context, since it can be easily adapted to minimize any retrieval cost, as far as costs add over cache misses.

Proceedings ArticleDOI
01 Jan 2017
TL;DR: This paper proposes two techniques using data compression to optimize MLC STT-RAM cache design and introduces a second technique to increase the cache capacity by enabling the left hard-bit region to store another compressed cache line, which can improve the system performance for memory intensive workloads.
Abstract: Spin-transfer torque magnetic random access memory (STT-RAM) technology has emerged as a potential replacement of SRAM in cache design, especially for building large-scale and energy-efficient last level caches. Compared with singlelevel cell (SLC), multi-level cell (MLC) STT-RAM is expected to double cache capacity and increase system performance. However, the two-step read/write access schemes incur considerable energy consumption and performance degradation. In this paper, we propose two techniques using data compression to optimize MLC STT-RAM cache design. The first technique tries to compress a cache line and fit it into only the soft-bit region of the cells, so that reading or writing this cache line takes only one step which is fast and energy-efficient. We introduce a second technique to increase the cache capacity by enabling the left hard-bit region to store another compressed cache line, which can improve the system performance for memory intensive workloads. The experimental results show that, compared with a conventional MLC STT-RAM last level cache design, our overhead minimized technique reduces the dynamic energy consumption by 38.2% on average with the same system performance, and our capacity augmented technique boosts the system performance by 6.1% with 19.2% dynamic energy saving on average, across the evaluated multi-programmed benchmarks.

Proceedings ArticleDOI
04 Apr 2017
TL;DR: NetContainer is proposed, a software framework that achieves fine-grained hardware resource management for containerized NFV platform and outperforms conventional page coloring-based memory allocator by 48% in terms of successful call rate.
Abstract: With exploding traffic stuffing existing network infra-structure, today's telecommunication and cloud service providers resort to Network Function Virtualization (NFV) for greater agility and economics. Pioneer service provider such as AT&T proposes to adopt container in NFV to achieve shorter Virtualized Network Function (VNF) provisioning time and better runtime performance. However, we characterize typical NFV work-loads on the containers and find that the performance is unsatisfactory. We observe that the shared host OS net-work stack is the main bottleneck, where the traffic flow processing involves a large amount of intermediate memory buffers and results in significant last level cache pollution. Existing OS memory allocation policies fail to exploit the locality and data sharing information among buffers. In this paper, we propose NetContainer, a software framework that achieves fine-grained hardware resource management for containerized NFV platform. NetContainer employs a cache access overheads guided page coloring scheme to coordinately address the inter-flow cache access overheads and intra-flow cache access overheads. It maps the memory buffer pages that manifest low cache access overheads (across a flow or among the flows) to the same last level cache partition. NetContainer exploits a footprint theory based method to estimate the cache access overheads and a Min-Cost Max-Flow model to guide the memory buffer mappings. We implement the NetContainer in Linux kernel and extensively evaluate it with real NFV workloads. Exper-imental results show that NetContainer outperforms conventional page coloring-based memory allocator by 48% in terms of successful call rate.

Proceedings ArticleDOI
26 Jan 2017
TL;DR: A new metric called the shared footprint is presented to mathematically compute the amount of data shared by any group of threads in any size cache, and a linear-time algorithm to measure shared footprint by scanning the memory trace of a multi-threaded program is presented.
Abstract: On modern multi-core processors, independent workloads often interfere with each other by competing for shared cache space. However, for multi-threaded workloads, where a single copy of data can be accessed by multiple threads, the threads can cooperatively share cache. Because data sharing consolidates the collective working set of threads, the effective size of shared cache becomes larger than it would have been when data are not shared. This paper presents a new theory of data sharing. It includes (1) a new metric called the shared footprint to mathematically compute the amount of data shared by any group of threads in any size cache, and (2) a linear-time algorithm to measure shared footprint by scanning the memory trace of a multi-threaded program. The paper presents the practical implementation and evaluates the new theory using 14 PARSEC and SPEC OMP benchmarks, including an example use of shared footprint in program optimization.

Proceedings ArticleDOI
27 Jun 2017
TL;DR: The overhead of PEBS is evaluated and the CPU overhead can be used to predict actual overhead incurred with complex workloads including multi-threaded ones with high accuracy, and PEBS incurs cache pollution and extra memory IO since PEBS writes data into the CPU cache.
Abstract: Analyzing system-noise incurred to high-throughput systems (e.g., Spark, RDBMS) from the underlying machines must be in the granularity of the message- or request-level to find the root causes of performance anomalies, because messages are passed through many components in very short periods. To this end, we consider using Precise Event Based Sampling (PEBS) equipped in Intel CPUs at higher sampling rates than used normally is promising. It saves context information (e.g., the general purpose registers) at occurrences of various hardware events such as cache misses. The information can be used to associate performance anomalies caused by system noise with specific messages. One challenge is that quantitative analysis of PEBS overhead with high sampling rates has not yet been studied. This is critical because high sampling rates can cause severe overhead but performance problems are often reproducible only in real environments. In this paper, we evaluate the overhead of PEBS and show: (1) every time PEBS saves context information, the target workload slows down by 200-300 ns due to the CPU overhead of PEBS, (2) the CPU overhead can be used to predict actual overhead incurred with complex workloads including multi-threaded ones with high accuracy, and (3) PEBS incurs cache pollution and extra memory IO since PEBS writes data into the CPU cache, and the severity of cache pollution is affected both by the sampling rate and the buffer size allocated for PEBS. To the best of our knowledge, we are the first to quantitatively analyze the overhead of PEBS.

Patent
27 Jul 2017
TL;DR: In this article, a region migration cache is used to improve performance in sparsely-used memory systems by migrating regions of main memory corresponding to the working footprint of the main memory to the cache.
Abstract: A memory access profiling and region migration technique makes allocation and replacement decisions for periodic migration of most frequently accessed regions of main memory to least frequently accessed regions of a region migration cache, in background operations. The technique improves performance in sparsely-used memory systems by migrating regions of main memory corresponding to the working footprint of main memory to the region migration cache. A method includes profiling a stream of memory accesses to generate an access frequency ranked list of address ranges of main memory and corresponding access frequencies based on memory addresses in the stream of memory accesses. The method includes periodically migrating to a region migration cache contents of a region of main memory selected based on the access frequency ranked list. The method includes storing a memory address range corresponding to the contents of the region migration cache in a tag map.

Journal ArticleDOI
Chen Yang1, Leibo Liu1, Luo Kai1, Shouyi Yin1, Shaojun Wei1 
TL;DR: Experimental results showed that CIACP outperformed state-of-the-art utility-based cache partitioning techniques by up to 16 percent in performance.
Abstract: Multiple coarse-grained reconfigurable arrays (CGRA), which are organized in parallel or pipeline to complete applications, have become a productive solution to balance the performance with the flexibility. One of the keys to obtain high performance from multiple CGRAs is to manage the shared on-chip cache efficiently to reduce off-chip memory bandwidth requirements. Cache partitioning has been viewed as a promising technique to enhance the efficiency of a shared cache. However, the majority of prior partitioning techniques were developed for multi-core platform and aimed at multi-programmed workloads. They cannot directly address the adverse impacts of data correlation and computation imbalance among competing CGRAs in multi-CGRA platform. This paper proposes a correlation- and iteration- aware cache partitioning (CIACP) mechanism for shared cache partitioning in multiple CGRAs systems. This mechanism employs correlation monitors (CMONs) to trace the amount of overlapping data among parallel CGRAs, and iteration monitors (IMONs) to track the computation load of each CGRA. Using the information collected by CMONs and IMONs, the CIACP mechanism can eliminate redundant cache utilization of the overlapping data and can also shorten the total execution time of pipelined CGRAs. Experimental results showed that CIACP outperformed state-of-the-art utility-based cache partitioning techniques by up to 16 percent in performance.

Patent
30 May 2017
TL;DR: In this paper, the authors present a memory having a static cache and a dynamic cache, where the memory includes a first portion configured to operate as a static single level cell (SLC) cache, and a second portion configured as a dynamic SLC cache when the entire first portion of the memory has data stored therein.
Abstract: The present disclosure includes memory having a static cache and a dynamic cache. A number of embodiments include a memory, wherein the memory includes a first portion configured to operate as a static single level cell (SLC) cache and a second portion configured to operate as a dynamic SLC cache when the entire first portion of the memory has data stored therein.

Journal ArticleDOI
TL;DR: A new QoE (quality of experience)-driven video cache management scheme with the consideration of the parameters from three parties for video provisioning, with statistics of video popularities and under limited cache capacity is proposed.
Abstract: With the development of wireless cloud computing, video caching in the radio access network (RAN) of cellular networks has attracted extensive attention due to its lower delay and higher resource utilization efficiency. Nevertheless, existing video cache management mostly made decisions only according to the video coding requirements, without considering users' individual requirements for the video service and without making full use of the abundant network-side information in real time or from statistics. In this paper, we propose a new QoE (quality of experience)-driven video cache management scheme with the consideration of the parameters from three parties (i.e. client, base station, and RAN cache server) for video provisioning, with statistics of video popularities and under limited cache capacity. Specifically, through experiments we establish the mapping relationship between the QoE value and the three key parameters (i.e. the request rate from the client, the bandwidth of air interface, and the response rate of the cache server). Firstly, we allocate different gross caches for different video clips according to their popularities. Secondly, we optimize the cache space allocation for each individual video clip based on the QoE mapping relationship and the different models of the request rate and the bandwidth, with the convex optimization method and the Lagrange multiplier solution. The experiments results indicate that the proposed video cache scheme has better QoE performance under the constraints of the total cache capacity, specific distributions of the request rate and the bandwidth.

Journal ArticleDOI
TL;DR: This work designs provably optimal policies for jointly minimizing the content retrieval delay and the flash damage and numerically compares them against prior policies.
Abstract: Caches in Content-Centric Networks (CCN) are increasingly adopting flash memory based storage. The current flash cache technology stores all files with the largest possible “expiry date,” i.e., the files are written in the memory so that they are retained for as long as possible. This, however, does not leverage the CCN data characteristics where content is typically short-lived and has a distinct popularity profile. Writing files in a cache using the longest retention time damages the memory device thus reducing its lifetime. However, writing using a small retention time can increase the content retrieval delay, since, at the time a file is requested, the file may already have been expired from the memory. This motivates us to consider a joint optimization wherein we obtain optimal policies for jointly minimizing the content retrieval delay (which is a network-centric objective) and the flash damage (which is a device-centric objective). Caching decisions now not only involve what to cache but also for how long to cache each file. We design provably optimal policies and numerically compare them against prior policies.

Proceedings ArticleDOI
05 Dec 2017
TL;DR: This paper identifies situations where considering CRPD and CPRO separately might result in overestimating the total memory overhead suffered by tasks, and derives new analyses that integrate the calculation of CR PD and C PRO.
Abstract: Schedulability analysis for tasks running on micro- processors with cache memory is incomplete without a treatment of Cache Related Preemption Delays (CRPD) and Cache Persistence Reload Overheads (CPRO). State-of-the-art analyses compute CRPD and CPRO independently, which might result in counting the same overhead more than once. In this paper, we analyze the pessimism associated with the independent calculation of CRPD and CPRO in comparison to an integrated approach. We answer two main questions: (1) Is it benecial to integrate the calculation of CRPD and CPRO? (2) When and to what extent can we gain in terms of schedulability by integrating the calculation of CRPD and CPRO? To achieve this, we (i) identify situations where considering CRPD and CPRO separately might result in overestimating the total memory overhead suffered by tasks, (ii) derive new analyses that integrate the calculation of CRPD and CPRO; and (iii) perform a thorough experimental evaluation using benchmarks to compare the performance of the integrated analysis against the separate calculation of CRPD and CPRO.

Journal ArticleDOI
TL;DR: A hybrid memory aware cache partitioning technique (HAP) to dynamically adjust the cache spaces for DRAM and NVM data based on TMPKI, which can exactly reflect the LLC performance on the top of hybrid memories.
Abstract: Data-center servers benefit from large-capacity memory systems to run multiple processes simultaneously. Hybrid DRAM-NVM memory is attractive for increasing memory capacity by exploiting the scalability of Non-Volatile Memory (NVM). However, current LLC policies are unaware of hybrid memory. Cache misses to NVM introduce high cost due to long NVM latency. Moreover, evicting dirty NVM data suffer from long write latency. We propose hybrid memory aware cache partitioning to dynamically adjust cache spaces and give NVM dirty data more chances to reside in LLC. Experimental results show Hybrid-memory-Aware Partition (HAP) improves performance by 46.7% and reduces energy consumption by 21.9% on average against LRU management. Moreover, HAP averagely improves performance by 9.3% and reduces energy consumption by 6.4% against a state-of-the-art cache mechanism.