scispace - formally typeset
Search or ask a question

Showing papers on "Cache pollution published in 2013"


Proceedings ArticleDOI
08 Apr 2013
TL;DR: Through the analysis, light is shed on how modern hardware affects the implementation of data operators and the fastest implementation of radix join to date is provided, reaching close to 200 million tuples per second.
Abstract: The architectural changes introduced with multi-core CPUs have triggered a redesign of main-memory join algorithms. In the last few years, two diverging views have appeared. One approach advocates careful tailoring of the algorithm to the architectural parameters (cache sizes, TLB, and memory bandwidth). The other approach argues that modern hardware is good enough at hiding cache and TLB miss latencies and, consequently, the careful tailoring can be omitted without sacrificing performance. In this paper we demonstrate through experimental analysis of different algorithms and architectures that hardware still matters. Join algorithms that are hardware conscious perform better than hardware-oblivious approaches. The analysis and comparisons in the paper show that many of the claims regarding the behavior of join algorithms that have appeared in literature are due to selection effects (relative table sizes, tuple sizes, the underlying architecture, using sorted data, etc.) and are not supported by experiments run under different parameters settings. Through the analysis, we shed light on how modern hardware affects the implementation of data operators and provide the fastest implementation of radix join to date, reaching close to 200 million tuples per second.

265 citations


Patent
25 Jan 2013
TL;DR: In this paper, a de-duplication cache is configured to cache data for access by a plurality of different storage clients, such as virtual machines, and metadata pertaining to the contents of the cache may be persisted and/or transferred with respective storage clients.
Abstract: A de-duplication is configured to cache data for access by a plurality of different storage clients, such as virtual machines. A virtual machine may comprise a virtual machine de-duplication module configured to identify data for admission into the de-duplication cache. Data admitted into the de-duplication cache may be accessible by two or more storage clients. Metadata pertaining to the contents of the de-duplication cache may be persisted and/or transferred with respective storage clients such that the storage clients may access the contents of the de-duplication cache after rebooting, being power cycled, and/or being transferred between hosts.

223 citations


Proceedings ArticleDOI
09 Apr 2013
TL;DR: A complete framework to analyze and profile task memory access patterns and a novel kernel-level cache management technique to enforce an efficient and deterministic cache allocation of the most frequently accessed memory areas are proposed.
Abstract: Multi-core architectures are shaking the fundamental assumption that in real-time systems the WCET, used to analyze the schedulability of the complete system, is calculated on individual tasks. This is not even true in an approximate sense in a modern multi-core chip, due to interference caused by hardware resource sharing. In this work we propose (1) a complete framework to analyze and profile task memory access patterns and (2) a novel kernel-level cache management technique to enforce an efficient and deterministic cache allocation of the most frequently accessed memory areas. In this way, we provide a powerful tool to address one of the main sources of interference in a system where the last level of cache is shared among two or more CPUs. The technique has been implemented on commercial hardware and our evaluations show that it can be used to significantly improve the predictability of a given set of critical tasks.

207 citations


Proceedings ArticleDOI
23 Jun 2013
TL;DR: This paper introduces Footprint Cache, an efficient die-stacked DRAM cache design for server processors that eliminates the excessive off-chip traffic associated with page-based designs, while preserving their high hit ratio, small tag array overhead, and low lookup latency.
Abstract: Recent research advocates using large die-stacked DRAM caches to break the memory bandwidth wall. Existing DRAM cache designs fall into one of two categories --- block-based and page-based. The former organize data in conventional blocks (e.g., 64B), ensuring low off-chip bandwidth utilization, but co-locate tags and data in the stacked DRAM, incurring high lookup latency. Furthermore, such designs suffer from low hit ratios due to poor temporal locality. In contrast, page-based caches, which manage data at larger granularity (e.g., 4KB pages), allow for reduced tag array overhead and fast lookup, and leverage high spatial locality at the cost of moving large amounts of data on and off the chip.This paper introduces Footprint Cache, an efficient die-stacked DRAM cache design for server processors. Footprint Cache allocates data at the granularity of pages, but identifies and fetches only those blocks within a page that will be touched during the page's residency in the cache --- i.e., the page's footprint. In doing so, Footprint Cache eliminates the excessive off-chip traffic associated with page-based designs, while preserving their high hit ratio, small tag array overhead, and low lookup latency. Cycle-accurate simulation results of a 16-core server with up to 512MB Footprint Cache indicate a 57% performance improvement over a baseline chip without a die-stacked cache. Compared to a state-of-the-art block-based design, our design improves performance by 13% while reducing dynamic energy of stacked DRAM by 24%.

207 citations


Journal ArticleDOI
TL;DR: This paper focuses on cache pollution attacks, where the adversary's goal is to disrupt cache locality to increase link utilization and cache misses for honest consumers, and illustrates that existing proactive countermeasures are ineffective against realistic adversaries.

139 citations


Proceedings ArticleDOI
23 Feb 2013
TL;DR: By adopting i2WAP, a new cache management policy that can reduce both inter- and intra-set write variations, this work can improve the lifetime of on-chip non-volatile caches by 75% on average and up to 224%.
Abstract: Modern computers require large on-chip caches, but the scalability of traditional SRAM and eDRAM caches is constrained by leakage and cell density. Emerging non-volatile memory (NVM) is a promising alternative to build large on-chip caches. However, limited write endurance is a common problem for non-volatile memory technologies. In addition, today's cache management might result in unbalanced write traffic to cache blocks causing heavily-written cache blocks to fail much earlier than others. Unfortunately, existing wear-leveling techniques for NVM-based main memories cannot be simply applied to NVM-based on-chip caches because cache writes have intra-set variations as well as inter-set variations. To solve this problem, we propose i2WAP, a new cache management policy that can reduce both inter- and intra-set write variations. i2WAP has two features: (1) Swap-Shift, an enhancement based on previous main memory wear-leveling to reduce cache inter-set write variations; (2) Probabilistic Set Line Flush, a novel technique to reduce cache intra-set write variations. Implementing i2WAP only needs two global counters and two global registers. By adopting i2WAP, we can improve the lifetime of on-chip non-volatile caches by 75% on average and up to 224%.

126 citations


Proceedings ArticleDOI
09 Jul 2013
TL;DR: A practical OS-level cache management scheme for multi-core real-time systems that provides predictable cache performance, addresses the aforementioned problems of existing software cache partitioning, and efficiently allocates cache partitions to schedule a given task set is proposed.
Abstract: Many modern multi-core processors sport a large shared cache with the primary goal of enhancing the statistic performance of computing workloads. However, due to resulting cache interference among tasks, the uncontrolled use of such a shared cache can significantly hamper the predictability and analyzability of multi-core real-time systems. Software cache partitioning has been considered as an attractive approach to address this issue because it does not require any hardware support beyond that available on many modern processors. However, the state-of-the-art software cache partitioning techniques face two challenges: (1) the memory co-partitioning problem, which results in page swapping or waste of memory, and (2) the availability of a limited number of cache partitions, which causes degraded performance. These are major impediments to the practical adoption of software cache partitioning. In this paper, we propose a practical OS-level cache management scheme for multi-core real-time systems. Our scheme provides predictable cache performance, addresses the aforementioned problems of existing software cache partitioning, and efficiently allocates cache partitions to schedule a given task set. We have implemented and evaluated our scheme in Linux/RK running on the Intel Core i7 quad-core processor. Experimental results indicate that, compared to the traditional approaches, our scheme is up to 39% more memory space efficient and consumes up to 25% less cache partitions while maintaining cache predictability. Our scheme also yields a significant utilization benefit that increases with the number of tasks.

125 citations


Journal ArticleDOI
01 Sep 2013
TL;DR: The results show that for higher skewed workloads the anti-caching architecture has a performance advantage over either of the other architectures tested of up to 9× for a data size 8× larger than memory.
Abstract: The traditional wisdom for building disk-based relational database management systems (DBMS) is to organize data in heavily-encoded blocks stored on disk, with a main memory block cache. In order to improve performance given high disk latency, these systems use a multi-threaded architecture with dynamic record-level locking that allows multiple transactions to access the database at the same time. Previous research has shown that this results in substantial overhead for on-line transaction processing (OLTP) applications [15].The next generation DBMSs seek to overcome these limitations with architecture based on main memory resident data. To overcome the restriction that all data fit in main memory, we propose a new technique, called anti-caching, where cold data is moved to disk in a transactionally-safe manner as the database grows in size. Because data initially resides in memory, an anti-caching architecture reverses the traditional storage hierarchy of disk-based systems. Main memory is now the primary storage device.We implemented a prototype of our anti-caching proposal in a high-performance, main memory OLTP DBMS and performed a series of experiments across a range of database sizes, workload skews, and read/write mixes. We compared its performance with an open-source, disk-based DBMS optionally fronted by a distributed main memory cache. Our results show that for higher skewed workloads the anti-caching architecture has a performance advantage over either of the other architectures tested of up to 9× for a data size 8× larger than memory.

117 citations


Patent
17 Oct 2013
TL;DR: In this article, the authors describe a write-anywhere storage system with a plurality of clients coupled with an array of storage nodes, each storage node having multiple primary storage devices coupled with cache memory.
Abstract: Systems and methods for reducing metadata in a write-anywhere storage system are disclosed herein. The system includes a plurality of clients coupled with a plurality of storage nodes, each storage node having a plurality of primary storage devices coupled thereto. A memory management unit including cache memory is included in the client. The memory management unit serves as a cache for data produced by the clients before the data is stored in the primary storage. The cache includes an extent cache, an extent index, a commit cache and a commit index. The movement of data and metadata is by an interval tree. Methods for reducing data in the interval tree increase data storage and data retrieval performance of the system.

109 citations


Proceedings ArticleDOI
14 Apr 2013
TL;DR: This work demonstrates that certain cache networks are non-ergodic in that their steady-state characterization depends on the initial state of the system, and establishes several important properties of cache networks, in the form of three independently-sufficient conditions for a cache network to comprise a single ergodic component.
Abstract: Over the past few years Content-Centric Networking, a networking model in which host-to-content communication protocols are introduced, has been gaining much attention. A central component of such an architecture is a large-scale interconnected caching system. To date, the way these Cache Networks operate and perform is still poorly understood. In this work, we demonstrate that certain cache networks are non-ergodic in that their steady-state characterization depends on the initial state of the system. We then establish several important properties of cache networks, in the form of three independently-sufficient conditions for a cache network to comprise a single ergodic component. Each property targets a different aspect of the system - topology, admission control and cache replacement policies. Perhaps most importantly we demonstrate that cache replacement can be grouped into equivalence classes, such that the ergodicity (or lack-thereof) of one policy implies the same property holds for all policies in the class.

103 citations


Proceedings ArticleDOI
06 May 2013
TL;DR: A novel cache management algorithm for flash-based disk cache, named Lazy Adaptive Replacement Cache (LARC), which can filter out seldom accessed blocks and prevent them from entering cache and improves performance and extends SSD lifetime at the same time.
Abstract: The increasing popularity of flash memory has changed storage systems. Flash-based solid state drive(SSD) is now widely deployed as cache for magnetic hard disk drives(HDD) to speed up data intensive applications. However, existing cache algorithms focus exclusively on performance improvements and ignore the write endurance of SSD. In this paper, we proposed a novel cache management algorithm for flash-based disk cache, named Lazy Adaptive Replacement Cache(LARC). LARC can filter out seldom accessed blocks and prevent them from entering cache. This avoids cache pollution and keeps popular blocks in cache for a longer period of time, leading to higher hit rate. Meanwhile, LARC reduces the amount of cache replacements thus incurs less write traffics to SSD, especially for read dominant workloads. In this way, LARC improves performance and extends SSD lifetime at the same time. LARC is self-tuning and low overhead. It has been extensively evaluated by both trace-driven simulations and a prototype implementation in flashcache. Our experiments show that LARC outperforms state-of-art algorithms and reduces write traffics to SSD by up to 94.5% for read dominant workloads, 11.2-40.8% for write dominant workloads.

Proceedings ArticleDOI
18 Mar 2013
TL;DR: This work proposes DWM-TAPESTRI, a new all-spin cache design that utilizes Domain Wall Memory (DWM) with shift based writes at all levels of the cache hierarchy, and proposes pre-shifting as an architectural technique to hide the latency of shift operations that is inherent to DWM.
Abstract: Spin-based memories are promising candidates for future on-chip memories due to their high density, non-volatility, and very low leakage. However, the high energy and latency of write operations in these memories is a major challenge. In this work, we explore a new approach -- shift based write -- that offers a fast and energy-efficient alternative to performing writes in spin-based memories. We propose DWM-TAPESTRI, a new all-spin cache design that utilizes Domain Wall Memory (DWM) with shift based writes at all levels of the cache hierarchy. The proposed write scheme enables DWM to be used, for the first time, in L1 caches and in tag arrays, where the inefficiency of writes in spin memories has traditionally precluded their use. At the circuit level, we propose bit-cell designs utilizing shift-based writes, which are tailored to the differing requirements of different levels in the cache hierarchy. We also propose pre-shifting as an architectural technique to hide the latency of shift operations that is inherent to DWM. We performed a systematic device-circuit-architecture evaluation of the proposed design. Over a wide range of SPEC 2006 benchmarks, DWM-TAPESTRI achieves 8.2X improvement in energy and 4X improvement in area, with virtually identical performance, compared to an iso-capacity SRAM cache. Compared to an iso-capacity STT-MRAM cache, the proposed design achieves around 1.6X improvement in both area and energy under iso-performance conditions.

Patent
05 Dec 2013
TL;DR: In this article, a cache and/or storage module may be configured to reduce write amplification in a cache storage, which may occur due to an over-permissive admission policy, or it may arise due to the write-once properties of the storage medium.
Abstract: A cache and/or storage module may be configured to reduce write amplification in a cache storage. Cache layer write amplification (CLWA) may occur due to an over-permissive admission policy. The cache module may be configured to reduce CLWA by configuring admission policies to avoid unnecessary writes. Admission policies may be predicated on access and/or sequentiality metrics. Flash layer write amplification (FLWA) may arise due to the write-once properties of the storage medium. FLWA may be reduced by delegating cache eviction functionality to the underlying storage layer. The cache and storage layers may be configured to communicate coordination information, which may be leveraged to improve the performance of cache and/or storage operations.

Proceedings ArticleDOI
18 Nov 2013
TL;DR: An efficient compiler framework for cache bypassing on GPUs is proposed and efficient algorithms that judiciously select global load instructions for cache access or bypass are presented.
Abstract: Graphics Processing Units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is not ideal for those general purpose applications with irregular memory accesses. Hence, GPU vendors have introduced caches in conjunction with scratchpad memory in the recent generations of GPUs. The caches on GPUs are highly-configurable. The programmer or the compiler can explicitly control cache access or bypass for global load instructions. This highly-configurable feature of GPU caches opens up the opportunities for optimizing the cache performance. In this paper, we propose an efficient compiler framework for cache bypassing on GPUs. Our objective is to efficiently utilize the configurable cache and improve the overall performance for general purpose GPU applications. In order to achieve this goal, we first characterize GPU cache utilization and develop performance metrics to estimate the cache reuses and memory traffic. Next, we present efficient algorithms that judiciously select global load instructions for cache access or bypass. Finally, we integrate our techniques into an automatic compiler framework that leverages PTX instruction set architecture. Experiments evaluation demonstrates that compared to cache-all and bypass-all solutions, our techniques can achieve considerable performance improvement.

Proceedings ArticleDOI
Tian Luo1, Siyuan Ma1, Rubao Lee1, Xiaodong Zhang1, Deng Liu2, Li Zhou3 
07 Oct 2013
TL;DR: The design and implementation of S-CAVE, a hypervisor-based SSD caching facility, which effectively manages a storage cache in a Multi-VM environment by collecting and exploiting runtime information from both VMs and storage devices is presented.
Abstract: A unique challenge for SSD storage caching management in a virtual machine (VM) environment is to accomplish the dual objectives: maximizing utilization of shared SSD cache devices and ensuring performance isolation among VMs. In this paper, we present our design and implementation of S-CAVE, a hypervisor-based SSD caching facility, which effectively manages a storage cache in a Multi-VM environment by collecting and exploiting runtime information from both VMs and storage devices. Due to a hypervisor's unique position between VMs and hardware resources, S-CAVE does not require any modification to guest OSes, user applications, or the underlying storage system. A critical issue to address in S-CAVE is how to allocate limited and shared SSD cache space among multiple VMs to achieve the dual goals. This is accomplished in two steps. First, we propose an effective metric to determine the demand for SSD cache space of each VM. Next, by incorporating this cache demand information into a dynamic control mechanism, S-CAVE is able to efficiently provide a fair share of cache space to each VM while achieving the goal of best utilizing the shared SSD cache device. In accordance with the constraints of all the functionalities of a hypervisor, S-CAVE incurs minimum overhead in both memory space and computing time. We have implemented S-CAVE in vSphere ESX, a widely used commercial hypervisor from VMWare. Our extensive experiments have shown its strong effectiveness for various data-intensive applications.

Proceedings ArticleDOI
07 Dec 2013
TL;DR: The Decoupled Compressed Cache (DCC) is proposed, which exploits spatial locality to improve both the performance and energy-efficiency of cache compression and nearly doubles the benefits of previous compressed caches with similar area overhead.
Abstract: In multicore processor systems, last-level caches (LLCs) play a crucial role in reducing system energy by i) filtering out expensive accesses to main memory and ii) reducing the time spent executing in high-power states. Cache compression can increase effective cache capacity and reduce misses, improve performance, and potentially reduce system energy. However, previous compressed cache designs have demonstrated only limited benefits due to internal fragmentation and limited tags. In this paper, we propose the Decoupled Compressed Cache (DCC), which exploits spatial locality to improve both the performance and energy-efficiency of cache compression. DCC uses decoupled super-blocks and non-contiguous sub-block allocation to decrease tag overhead without increasing internal fragmentation. Non-contiguous sub-blocks also eliminate the need for energy-expensive re-compaction when a block's size changes. Compared to earlier compressed caches, DCC increases normalized effective capacity to a maximum of 4 and an average of 2.2 for a wide range of workloads. A further optimized Co-DCC (Co-Compacted DCC) design improves the average normalized effective capacity to 2.6 by co-compacting the compressed blocks in a super-block. Our simulations show that DCC nearly doubles the benefits of previous compressed caches with similar area overhead. We also demonstrate a practical DCC design based on a recent commercial LLC design.

Patent
07 Jan 2013
TL;DR: In this paper, a cache controller determines whether a region in the main memory where the cache block is from is configured in the writeback mode or the write-through mode and then performs a corresponding write operation in the cache memory.
Abstract: The described embodiments include a main memory and a cache memory (or “cache”) with a cache controller that includes a mode-setting mechanism. In some embodiments, the mode-setting mechanism is configured to dynamically determine an access pattern for the main memory. Based on the determined access pattern, the mode-setting mechanism configures at least one region of the main memory in a write-back mode and configures other regions of the main memory in a write-through mode. In these embodiments, when performing a write operation in the cache memory, the cache controller determines whether a region in the main memory where the cache block is from is configured in the write-back mode or the write-through mode and then performs a corresponding write operation in the cache memory

Proceedings ArticleDOI
07 Oct 2013
TL;DR: HeLM is able to throttle GPU LLC accesses and yield LLC space to cache sensitive CPU applications and outperforms LRU policy by 12.5% and TAP-RRIP by 5.6% for a processor with 4 CPU and 4 GPU cores.
Abstract: Heterogeneous multicore processors that integrate CPU cores and data-parallel accelerators such as GPU cores onto the same die raise several new issues for sharing various on-chip resources. The shared last-level cache (LLC) is one of the most important shared resources due to its impact on performance. Accesses to the shared LLC in heterogeneous multicore processors can be dominated by the GPU due to the significantly higher number of threads supported. Under current cache management policies, the CPU applications' share of the LLC can be significantly reduced in the presence of competing GPU applications. For cache sensitive CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can often tolerate increased memory access latency in the presence of LLC misses when there is sufficient thread-level parallelism. In this work, we propose Heterogeneous LLC Management (HeLM), a novel shared LLC management policy that takes advantage of the GPU's tolerance for memory access latency. HeLM is able to throttle GPU LLC accesses and yield LLC space to cache sensitive CPU applications. GPU LLC access throttling is achieved by allowing GPU threads that can tolerate longer memory access latencies to bypass the LLC. The latency tolerance of a GPU application is determined by the availability of thread-level parallelism, which can be measured at runtime as the average number of threads that are available for issuing. Our heterogeneous LLC management scheme outperforms LRU policy by 12.5% and TAP-RRIP by 5.6% for a processor with 4 CPU and 4 GPU cores.

Proceedings ArticleDOI
03 Dec 2013
TL;DR: A coordinated cache and bank coloring scheme that is designed to prevent Cache and bank interference simultaneously is presented and implemented in the Linux kernel.
Abstract: In commercial-off-the-shelf (COTS) multi-core systems, the execution times of tasks become hard to predict because of contention on shared resources in the memory hierarchy. In particular, a task running in one processor core can delay the execution of another task running in another processor core. This is due to the fact that tasks can access data in the same cache set shared among processor cores or in the same memory bank in the DRAM memory (or both). Such cache and bank interference effects have motivated the need to create isolation mechanisms for resources accessed by more than one task. One popular isolation mechanism is cache coloring that divides the cache into multiple partitions. With cache coloring, each task can be assigned exclusive cache partitions, thereby preventing cache interference from other tasks. Similarly, bank coloring allows assigning exclusive bank partitions to tasks. While cache coloring and some bank coloring mechanisms have been studied separately, interactions between the two schemes have not been studied. Specifically, while memory accesses to two different bank colors do not interfere with each other at the bank level, they may interact at the cache level. Similarly, two different cache colors avoid cache interference but may not prevent bank interference. Therefore it is necessary to coordinate cache and bank coloring approaches. In this paper, we present a coordinated cache and bank coloring scheme that is designed to prevent cache and bank interference simultaneously. We also developed color allocation algorithms for configuring a virtual memory system to support our scheme which has been implemented in the Linux kernel. In our experiments, we observed that the execution time can increase by 60% due to inter-task interference when we use only cache coloring. Our coordinated approach can reduce this figure down to 12% (an 80% reduction).

Patent
11 Mar 2013
TL;DR: In this article, a multi-core cache coherence system with a local/shared cache hierarchy is described. The system includes multiple processor cores, a main memory, and a local cache memory associated with each core for storing cache lines accessible only by the associated core.
Abstract: System and methods for cache coherence in a multi-core processing environment having a local/shared cache hierarchy. The system includes multiple processor cores, a main memory, and a local cache memory associated with each core for storing cache lines accessible only by the associated core. Cache lines are classified as either private or shared. A shared cache memory is coupled to the local cache memories and main memory for storing cache lines. The cores follow a write-back to the local memory for private cache lines, and a write-through to the shared memory for shared cache lines. Shared cache lines in local cache memory enter a transient dirty state when written by the core. Shared cache lines transition from a transient dirty to a valid state with a self-initiated write-through to the shared memory. The write-through to shared memory can include only data that was modified in the transient dirty state.

Proceedings ArticleDOI
07 Jul 2013
TL;DR: This paper proposes a novel caching approach that can achieve a significantly larger reduction in peak rate compared to previously known caching schemes, and argues that the performance of the proposed scheme is within a constant factor from the information-theoretic optimum for all values of the problem parameters.
Abstract: Caching is a technique to reduce peak traffic rates by prefetching popular content in memories at the end users. This paper proposes a novel caching approach that can achieve a significantly larger reduction in peak rate compared to previously known caching schemes. In particular, the improvement can be on the order of the number of end users in the network. Conventionally, cache memories are exploited by delivering requested contents in part locally rather than through the network. The gain offered by this approach, which we term local caching gain, depends on the local cache size (i.e., the cache available at each individual user). In this paper, we introduce and exploit a second, global, caching gain, which is not utilized by conventional caching schemes. This gain depends on the aggregate global cache size (i.e., the cumulative cache available at all users), even though there is no cooperation among the caches. To evaluate and isolate these two gains, we introduce a new, information-theoretic formulation of the caching problem focusing on its basic structure. For this setting, the proposed scheme exploits both local and global caching gains, leading to a multiplicative improvement in the peak rate compared to previously known schemes. Moreover, we argue that the performance of the proposed scheme is within a constant factor from the information-theoretic optimum for all values of the problem parameters.

Patent
30 Sep 2013
TL;DR: In this paper, the authors present a method and a system for improving performance of flash cache memory, such as in a host of a storage environment, by preventing a cache cell from reaching an operation limit.
Abstract: Example embodiments of the present invention relate to a method and a system for improving performance of flash cache memory, such as in a host of a storage environment, for example, by preventing a cache cell from reaching an operation limit. The method includes determining that a number of operations to a first cell of a flash memory has reached a threshold and managing the flash memory according to the determination to prevent a failure of a second cell of the flash memory.

Proceedings ArticleDOI
07 Oct 2013
TL;DR: Shared last-level caches, widely used in chip-multi-processors (CMPs), face two fundamental limitations: the latency and energy of shared caches degrade as the system scales up and cache partitioning techniques only provide isolation but do not reduce access latency.
Abstract: Shared last-level caches, widely used in chip-multiprocessors (CMPs), face two fundamental limitations. First, the latency and energy of shared caches degrade as the system scales up. Second, when multiple workloads share the CMP, they suffer from interference in shared cache accesses. Unfortunately, prior research addressing one issue either ignores or worsens the other: NUCA techniques reduce access latency but are prone to hotspots and interference, and cache partitioning techniques only provide isolation but do not reduce access latency. We present Jigsaw, a technique that jointly addresses the scalability and interference problems of shared caches. Hardware lets software define shares, collections of cache bank partitions that act as virtual caches, and map data to shares. Shares give software full control over both data placement and capacity allocation. Jigsaw implements efficient hardware support for share management, monitoring, and adaptation. We propose novel resource-management algorithms and use them to develop a system-level runtime that leverages Jigsaw to both maximize cache utilization and place data close to where it is used.We evaluate Jigsaw using extensive simulations of 16- and 64-core tiled CMPs. Jigsaw improves performance by up to 2.2x (18% avg) over a conventional shared cache, and significantly outperforms state-of-the-art NUCA and partitioning techniques.

Patent
01 Apr 2013
TL;DR: In this article, a method and apparatus for dynamically controlling the cache size of a processor is presented. But the method does not specify how to dynamically change the operating point of the processor, nor how to selectively remove power from different ways of the cache memory.
Abstract: A method and apparatus for dynamically controlling a cache size is disclosed. In one embodiment, a method includes changing an operating point of a processor from a first operating point to a second operating point, and selectively removing power from one or more ways of a cache memory responsive to changing the operating point. The method further includes processing one or more instructions in the processor subsequent to removing power from the one or more ways of the cache memory, wherein said processing includes accessing one or more ways of the cache memory from which power was not removed.

Proceedings Article
26 Jun 2013
TL;DR: It is found that the chief benefit of the flash cache is its size, not its persistence, and for some workloads a large flash cache allows using miniscule amounts of RAM for file caching leaving more memory available for application use.
Abstract: Flash memory has recently become popular as a caching medium. Most uses to date are on the storage server side. We investigate a different structure: flash as a cache on the client side of a networked storage environment. We use trace-driven simulation to explore the design space. We consider a wide range of configurations and policies to determine the potential client-side caches might offer and how best to arrange them. Our results show that the flash cache writeback policy does not significantly affect performance. Write-through is sufficient; this greatly simplifies cache consistency handling. We also find that the chief benefit of the flash cache is its size, not its persistence. Cache persistence offers additional performance benefits at system restart at essentially no runtime cost. Finally, for some workloads a large flash cache allows using miniscule amounts of RAM for file caching (e.g., 256 KB) leaving more memory available for application use.

Proceedings ArticleDOI
01 Oct 2013
TL;DR: In-memory object caches, such as memcached, are critical to the success of popular web sites, by reducing database load and improving scalability, but unfortunately cache configuration is poorly understood.
Abstract: Large-scale in-memory object caches such as memcached are widely used to accelerate popular web sites and to reduce burden on backend databases. Yet current cache systems give cache operators limited information on what resources are required to optimally accommodate the present workload. This paper focuses on a key question for cache operators: how much total memory should be allocated to the in-memory cache tier to achieve desired performance?We present our Mimir system: a lightweight online profiler that hooks into the replacement policy of each cache server and produces graphs of the overall cache hit rate as a function of memory size. The profiler enables cache operators to dynamically project the cost and performance impact from adding or removing memory resources within a distributed in-memory cache, allowing "what-if" questions about cache performance to be answered without laborious offline tuning. Internally, Mimir uses a novel lock-free algorithm and lookup filters for quickly and dynamically estimating hit rate of LRU caches.Running Mimir as a profiler requires minimal changes to the cache server, thanks to a lean API. Our experiments show that Mimir produces dynamic hit rate curves with over 98% accuracy and 2--5% overhead on request latency and throughput when Mimir is run in tandem with memcached, suggesting online cache profiling can be a practical tool for improving provisioning of large caches.

Proceedings ArticleDOI
07 Dec 2013
TL;DR: A novel last-level cache replacement algorithm with approximately the same complexity and storage requirements as tree-based PseudoLRU, but with performance matching state of the art techniques such as dynamic re-reference interval prediction (DRRIP) and protecting distance policy (PDP).
Abstract: Last-level caches mitigate the high latency of main memory. A good cache replacement policy enables high performance for memory intensive programs. To be useful to industry, a cache replacement policy must deliver high performance without high complexity or cost. For instance, the costly least-recently-used (LRU) replacement policy is not used in highly associative caches; rather, inexpensive policies with similar performance such as PseudoLRU are used. We propose a novel last-level cache replacement algorithm with approximately the same complexity and storage requirements as tree-based PseudoLRU, but with performance matching state of the art techniques such as dynamic re-reference interval prediction (DRRIP) and protecting distance policy (PDP). The algorithm is based on PseudoLRU, but uses set-dueling to dynamically adapt its insertion and promotion policy. It has slightly less than one bit of overhead per cache block, compared with two or more bits per cache block for competing policies. In this paper, we give the motivation behind the algorithm in the context of LRU with improved placement and promotion, then develop this motivation into a PseudoLRU-based algorithm, and finally give a version using set-dueling to allow adaptivity to changing program behavior. We show that, with a 16-way set-associative 4MB last-level cache, our adaptive PseudoLRU insertion and promotion algorithm yields a geometric mean speedup of 5.6% over true LRU over all the SPEC CPU 2006 benchmarks using far less overhead than LRU or other algorithms. On a memory-intensive subset of SPEC, the technique gives a geometric mean speedup of 15.6%. We show that the performance is comparable to state-of-the-art replacement policies that consume more than twice the area of our technique.

Proceedings ArticleDOI
30 Jun 2013
TL;DR: This paper analyzes the added write pressures that cache workloads place on flash devices and proposes optimizations at both the cache and flash management layers to improve endurance while maintaining or increasing cache hit rate.
Abstract: Flash memory is widely used for its fast random I/O access performance in a gamut of enterprise storage applications. However, due to the limited endurance and asymmetric write performance of flash memory, minimizing writes to a flash device is critical for both performance and endurance. Previous studies have focused on flash memory as a candidate for primary storage devices; little is known about its behavior as a Solid State Cache (SSC) device. In this paper, we propose HEC, a High Endurance Cache that aims to improve overall device endurance via reduced media writes and erases while maximizing cache hit rate performance. We analyze the added write pressures that cache workloads place on flash devices and propose optimizations at both the cache and flash management layers to improve endurance while maintaining or increasing cache hit rate. We demonstrate the individual and cumulative contributions of cache admission policy, cache eviction policy, flash garbage collection policy, and flash device configuration on a) hit rate, b) overall writes, and c) erases as seen by the SSC device. Through our improved cache and flash optimizations, 83% of the analyzed workload ensembles achieved increased or maintained hit rate with write reductions up to 20x, and erase count reductions up to 6x.

Patent
08 May 2013
TL;DR: In this article, a way table is provided for identifying which of a plurality of cache ways stores require data, and each way table entry corresponds to one of the translation look aside buffer (TLB) entries of the TLB and identifies, for each memory location, which cache way stores the data associated with that memory location.
Abstract: A data processing apparatus has a cache and a translation look aside buffer (TLB). A way table is provided for identifying which of a plurality of cache ways stores require data. Each way table entry corresponds to one of the TLB entries of the TLB and identifies, for each memory location of the page associated with the corresponding TLB entry, which cache way stores the data associated with that memory location. Also, the cache may be capable of servicing M access requests in the same processing cycle. An arbiter may select pending access requests for servicing by the cache in a way that ensures that the selected pending access requests specify a maximum of N different virtual page addresses, where N < M.

Patent
04 Feb 2013
TL;DR: In this paper, a cache controller is coupled to at least two flash bricks, each comprising a flash memory, and metadata is updated to indicate the flash brick having the flash memory on which data units are cached, wherein the metadata is used to determine the flash bricks on which the cache controller caches received data units.
Abstract: Provided is a method for managing cache memory to cache data units in at least one storage device. A cache controller is coupled to at least two flash bricks, each comprising a flash memory. Metadata indicates a mapping of the data units to the flash bricks caching the data units, wherein the metadata is used to determine the flash bricks on which the cache controller caches received data units. The metadata is updated to indicate the flash brick having the flash memory on which data units are cached.