scispace - formally typeset
Search or ask a question

Showing papers on "Cache algorithms published in 2013"


Journal ArticleDOI
TL;DR: This paper presents a comprehensive survey of state-of-art techniques aiming to address caching issues, with particular focus on reducing cache redundancy and improving the availability of cached content.

343 citations


Proceedings ArticleDOI
08 Apr 2013
TL;DR: Through the analysis, light is shed on how modern hardware affects the implementation of data operators and the fastest implementation of radix join to date is provided, reaching close to 200 million tuples per second.
Abstract: The architectural changes introduced with multi-core CPUs have triggered a redesign of main-memory join algorithms. In the last few years, two diverging views have appeared. One approach advocates careful tailoring of the algorithm to the architectural parameters (cache sizes, TLB, and memory bandwidth). The other approach argues that modern hardware is good enough at hiding cache and TLB miss latencies and, consequently, the careful tailoring can be omitted without sacrificing performance. In this paper we demonstrate through experimental analysis of different algorithms and architectures that hardware still matters. Join algorithms that are hardware conscious perform better than hardware-oblivious approaches. The analysis and comparisons in the paper show that many of the claims regarding the behavior of join algorithms that have appeared in literature are due to selection effects (relative table sizes, tuple sizes, the underlying architecture, using sorted data, etc.) and are not supported by experiments run under different parameters settings. Through the analysis, we shed light on how modern hardware affects the implementation of data operators and provide the fastest implementation of radix join to date, reaching close to 200 million tuples per second.

265 citations


Proceedings ArticleDOI
03 Nov 2013
TL;DR: This paper instrumented every Facebook-controlled layer of the stack and sampled the resulting event stream to obtain traces covering over 77 million requests for more than 1 million unique photos to study traffic patterns, cache access patterns, geolocation of clients and servers, and to explore correlation between properties of the content and accesses.
Abstract: This paper examines the workload of Facebook's photo-serving stack and the effectiveness of the many layers of caching it employs Facebook's image-management infrastructure is complex and geographically distributed It includes browser caches on end-user systems, Edge Caches at ~20 PoPs, an Origin Cache, and for some kinds of images, additional caching via Akamai The underlying image storage layer is widely distributed, and includes multiple data centersWe instrumented every Facebook-controlled layer of the stack and sampled the resulting event stream to obtain traces covering over 77 million requests for more than 1 million unique photos This permits us to study traffic patterns, cache access patterns, geolocation of clients and servers, and to explore correlation between properties of the content and accesses Our results (1) quantify the overall traffic percentages served by different layers: 655% browser cache, 200% Edge Cache, 46% Origin Cache, and 99% Backend storage, (2) reveal that a significant portion of photo requests are routed to remote PoPs and data centers as a consequence both of load-balancing and peering policy, (3) demonstrate the potential performance benefits of coordinating Edge Caches and adopting S4LRU eviction algorithms at both Edge and Origin layers, and (4) show that the popularity of photos is highly dependent on content age and conditionally dependent on the social-networking metrics we considered

225 citations


Patent
25 Jan 2013
TL;DR: In this paper, a de-duplication cache is configured to cache data for access by a plurality of different storage clients, such as virtual machines, and metadata pertaining to the contents of the cache may be persisted and/or transferred with respective storage clients.
Abstract: A de-duplication is configured to cache data for access by a plurality of different storage clients, such as virtual machines. A virtual machine may comprise a virtual machine de-duplication module configured to identify data for admission into the de-duplication cache. Data admitted into the de-duplication cache may be accessible by two or more storage clients. Metadata pertaining to the contents of the de-duplication cache may be persisted and/or transferred with respective storage clients such that the storage clients may access the contents of the de-duplication cache after rebooting, being power cycled, and/or being transferred between hosts.

223 citations


Proceedings ArticleDOI
09 Apr 2013
TL;DR: A complete framework to analyze and profile task memory access patterns and a novel kernel-level cache management technique to enforce an efficient and deterministic cache allocation of the most frequently accessed memory areas are proposed.
Abstract: Multi-core architectures are shaking the fundamental assumption that in real-time systems the WCET, used to analyze the schedulability of the complete system, is calculated on individual tasks. This is not even true in an approximate sense in a modern multi-core chip, due to interference caused by hardware resource sharing. In this work we propose (1) a complete framework to analyze and profile task memory access patterns and (2) a novel kernel-level cache management technique to enforce an efficient and deterministic cache allocation of the most frequently accessed memory areas. In this way, we provide a powerful tool to address one of the main sources of interference in a system where the last level of cache is shared among two or more CPUs. The technique has been implemented on commercial hardware and our evaluations show that it can be used to significantly improve the predictability of a given set of critical tasks.

207 citations


Proceedings ArticleDOI
23 Jun 2013
TL;DR: This paper introduces Footprint Cache, an efficient die-stacked DRAM cache design for server processors that eliminates the excessive off-chip traffic associated with page-based designs, while preserving their high hit ratio, small tag array overhead, and low lookup latency.
Abstract: Recent research advocates using large die-stacked DRAM caches to break the memory bandwidth wall. Existing DRAM cache designs fall into one of two categories --- block-based and page-based. The former organize data in conventional blocks (e.g., 64B), ensuring low off-chip bandwidth utilization, but co-locate tags and data in the stacked DRAM, incurring high lookup latency. Furthermore, such designs suffer from low hit ratios due to poor temporal locality. In contrast, page-based caches, which manage data at larger granularity (e.g., 4KB pages), allow for reduced tag array overhead and fast lookup, and leverage high spatial locality at the cost of moving large amounts of data on and off the chip.This paper introduces Footprint Cache, an efficient die-stacked DRAM cache design for server processors. Footprint Cache allocates data at the granularity of pages, but identifies and fetches only those blocks within a page that will be touched during the page's residency in the cache --- i.e., the page's footprint. In doing so, Footprint Cache eliminates the excessive off-chip traffic associated with page-based designs, while preserving their high hit ratio, small tag array overhead, and low lookup latency. Cycle-accurate simulation results of a 16-core server with up to 512MB Footprint Cache indicate a 57% performance improvement over a baseline chip without a die-stacked cache. Compared to a state-of-the-art block-based design, our design improves performance by 13% while reducing dynamic energy of stacked DRAM by 24%.

207 citations


Proceedings Article
14 Aug 2013
TL;DR: The results obtained exhibit the influence of cache size, line size, associativity, replacement policy, and coding style on the security of the executables and include the first formal proofs of security for implementations with countermeasures such as preloading and data-independent memory access patterns.
Abstract: We present CacheAudit, a versatile framework for the automatic, static analysis of cache side channels. Cache-Audit takes as input a program binary and a cache configuration, and it derives formal, quantitative security guarantees for a comprehensive set of side-channel adversaries, namely those based on observing cache states, traces of hits and misses, and execution times. Our technical contributions include novel abstractions to efficiently compute precise overapproximations of the possible side-channel observations for each of these adversaries. These approximations then yield upper bounds on the information that is revealed. In case studies we apply CacheAudit to binary executables of algorithms for symmetric encryption and sorting, obtaining the first formal proofs of security for implementations with countermeasures such as preloading and data-independent memory access patterns.

199 citations


Journal ArticleDOI
TL;DR: This paper investigates the problem of how to cache a set of media files with optimal streaming rates, under HTTP adaptive bit rate streaming over wireless networks, and finds there is a fundamental phase change in the optimal solution as the number of cached files grows.
Abstract: In this paper, we investigate the problem of optimal content cache management for HTTP adaptive bit rate (ABR) streaming over wireless networks. Specifically, in the media cloud, each content is transcoded into a set of media files with diverse playback rates, and appropriate files will be dynamically chosen in response to channel conditions and screen forms. Our design objective is to maximize the quality of experience (QoE) of an individual content for the end users, under a limited storage budget. Deriving a logarithmic QoE model from our experimental results, we formulate the individual content cache management for HTTP ABR streaming over wireless network as a constrained convex optimization problem. We adopt a two-step process to solve the snapshot problem. First, using the Lagrange multiplier method, we obtain the numerical solution of the set of playback rates for a fixed number of cache copies and characterize the optimal solution analytically. Our investigation reveals a fundamental phase change in the optimal solution as the number of cached files increases. Second, we develop three alternative search algorithms to find the optimal number of cached files, and compare their scalability under average and worst complexity metrics. Our numerical results suggest that, under optimal cache schemes, the maximum QoE measurement, i.e., mean-opinion-score (MOS), is a concave function of the allowable storage size. Our cache management can provide high expected QoE with low complexity, shedding light on the design of HTTP ABR streaming services over wireless networks.

186 citations


Proceedings ArticleDOI
12 Feb 2013
TL;DR: A novel buffer cache architecture is presented that subsumes the functionality of caching and journaling by making use of non-volatile memory such as PCM or STT-MRAM and shows that this scheme improves I/O performance by 76% on average and up to 240% compared to the existing Linux buffer cache with ext4 without any loss of reliability.
Abstract: Journaling techniques are widely used in modern file systems as they provide high reliability and fast recovery from system failures. However, it reduces the performance benefit of buffer caching as journaling accounts for a bulk of the storage writes in real system environments. In this paper, we present a novel buffer cache architecture that subsumes the functionality of caching and journaling by making use of non-volatile memory such as PCM or STT-MRAM. Specifically, our buffer cache supports what we call the in-place commit scheme. This scheme avoids logging, but still provides the same journaling effect by simply altering the state of the cached block to frozen. As a frozen block still performs the function of caching, we show that in-place commit does not degrade cache performance. We implement our scheme on Linux 2.6.38 and measure the throughput and execution time of the scheme with various file I/O benchmarks. The results show that our scheme improves I/O performance by 76% on average and up to 240% compared to the existing Linux buffer cache with ext4 without any loss of reliability.

171 citations


Proceedings ArticleDOI
12 Aug 2013
TL;DR: This paper designs five different hash-routing schemes which efficiently exploit in-network caches without requiring network routers to maintain per-content state information and shows that such schemes can increase cache hits by up to 31% in comparison to on-path caching, with minimal impact on the traffic dynamics of intra-domain links.
Abstract: Hash-routing has been proposed in the past as a mapping mechanism between object requests and cache clusters within enterprise networks.In this paper, we revisit hash-routing techniques and apply them to Information-Centric Networking (ICN) environments, where network routers have cache space readily available. In particular, we investigate whether hash-routing is a viable and efficient caching approach when applied outside enterprise networks, but within the boundaries of a domain.We design five different hash-routing schemes which efficiently exploit in-network caches without requiring network routers to maintain per-content state information.We evaluate the proposed hash-routing schemes using extensive simulations over real Internet domain topologies and compare them against various on-path caching mechanisms. We show that such schemes can increase cache hits by up to 31% in comparison to on-path caching, with minimal impact on the traffic dynamics of intra-domain links.

142 citations


Journal ArticleDOI
TL;DR: This paper focuses on cache pollution attacks, where the adversary's goal is to disrupt cache locality to increase link utilization and cache misses for honest consumers, and illustrates that existing proactive countermeasures are ineffective against realistic adversaries.

Patent
17 Sep 2013
TL;DR: In this article, a hop-count-based content caching scheme is proposed to decrease traffics of a network by the routing node's primarily judging whether to cache a content chunk by grasping an attribute of the received content chunk, and the caching probability of 1/hop count.
Abstract: Disclosed is hop-count based content caching. The present invention implements hop-count based content cache placement strategies that efficiently decrease traffics of a network by the routing node's primarily judging whether to cache a content chunk by grasping an attribute of the received content chunk; the routing node's secondarily judging whether to cache the content chunk based on a caching probability of ‘1/hop count’; and storing the content chunk and the hop count information in the cache memory of the routing node when the content chunk is determined to cache the content chunk as a result of the secondary judgment.

Proceedings ArticleDOI
01 May 2013
TL;DR: This work focuses on the cache allocation problem: namely, how to distribute the cache capacity across routers under a constrained total storage budget for the network, and formulate this problem as a content placement problem and obtains the exact optimal solution by a two-step method.
Abstract: Content-Centric Networking (CCN) is a promising framework for evolving the current network architecture, advocating ubiquitous in-network caching to enhance content delivery. Consequently, in CCN, each router has storage space to cache frequently requested content. In this work, we focus on the cache allocation problem: namely, how to distribute the cache capacity across routers under a constrained total storage budget for the network. We formulate this problem as a content placement problem and obtain the exact optimal solution by a two-step method. Through simulations, we use this algorithm to investigate the factors that affect the optimal cache allocation in CCN, such as the network topology and the popularity of content. We find that a highly heterogeneous topology tends to put most of the capacity over a few central nodes. On the other hand, heterogeneous content popularity has the opposite effect, by spreading capacity across far more nodes. Using our findings, we make observations on how network operators could best deploy CCN caches capacity.

Proceedings ArticleDOI
20 May 2013
TL;DR: This work obtains the first communication-optimal algorithm for all dimensions of rectangular matrices by combining the dimension-splitting technique with the recursive BFS/DFS approach, and shows significant speedups over existing parallel linear algebra libraries both on a 32-core shared-memory machine and on a distributed-memory supercomputer.
Abstract: Communication-optimal algorithms are known for square matrix multiplication. Here, we obtain the first communication-optimal algorithm for all dimensions of rectangular matrices. Combining the dimension-splitting technique of Frigo, Leiserson, Prokop and Ramachandran (1999) with the recursive BFS/DFS approach of Ballard, Demmel, Holtz, Lipshitz and Schwartz (2012) allows for a communication-optimal as well as cache and network-oblivious algorithm. Moreover, the implementation is simple: approximately 50 lines of code for the shared-memory version. Since the new algorithm minimizes communication across the network, between NUMA domains, and between levels of cache, it performs well in practice on both shared and distributed-memory machines. We show significant speedups over existing parallel linear algebra libraries both on a 32-core shared-memory machine and on a distributed-memory supercomputer.

Proceedings ArticleDOI
23 Feb 2013
TL;DR: By adopting i2WAP, a new cache management policy that can reduce both inter- and intra-set write variations, this work can improve the lifetime of on-chip non-volatile caches by 75% on average and up to 224%.
Abstract: Modern computers require large on-chip caches, but the scalability of traditional SRAM and eDRAM caches is constrained by leakage and cell density. Emerging non-volatile memory (NVM) is a promising alternative to build large on-chip caches. However, limited write endurance is a common problem for non-volatile memory technologies. In addition, today's cache management might result in unbalanced write traffic to cache blocks causing heavily-written cache blocks to fail much earlier than others. Unfortunately, existing wear-leveling techniques for NVM-based main memories cannot be simply applied to NVM-based on-chip caches because cache writes have intra-set variations as well as inter-set variations. To solve this problem, we propose i2WAP, a new cache management policy that can reduce both inter- and intra-set write variations. i2WAP has two features: (1) Swap-Shift, an enhancement based on previous main memory wear-leveling to reduce cache inter-set write variations; (2) Probabilistic Set Line Flush, a novel technique to reduce cache intra-set write variations. Implementing i2WAP only needs two global counters and two global registers. By adopting i2WAP, we can improve the lifetime of on-chip non-volatile caches by 75% on average and up to 224%.

Proceedings ArticleDOI
09 Jul 2013
TL;DR: A practical OS-level cache management scheme for multi-core real-time systems that provides predictable cache performance, addresses the aforementioned problems of existing software cache partitioning, and efficiently allocates cache partitions to schedule a given task set is proposed.
Abstract: Many modern multi-core processors sport a large shared cache with the primary goal of enhancing the statistic performance of computing workloads. However, due to resulting cache interference among tasks, the uncontrolled use of such a shared cache can significantly hamper the predictability and analyzability of multi-core real-time systems. Software cache partitioning has been considered as an attractive approach to address this issue because it does not require any hardware support beyond that available on many modern processors. However, the state-of-the-art software cache partitioning techniques face two challenges: (1) the memory co-partitioning problem, which results in page swapping or waste of memory, and (2) the availability of a limited number of cache partitions, which causes degraded performance. These are major impediments to the practical adoption of software cache partitioning. In this paper, we propose a practical OS-level cache management scheme for multi-core real-time systems. Our scheme provides predictable cache performance, addresses the aforementioned problems of existing software cache partitioning, and efficiently allocates cache partitions to schedule a given task set. We have implemented and evaluated our scheme in Linux/RK running on the Intel Core i7 quad-core processor. Experimental results indicate that, compared to the traditional approaches, our scheme is up to 39% more memory space efficient and consumes up to 25% less cache partitions while maintaining cache predictability. Our scheme also yields a significant utilization benefit that increases with the number of tasks.

Journal ArticleDOI
TL;DR: This paper presents an autonomic cache management approach for ICNs, where distributed managers residing in cache-enabled nodes decide on which information items to cache, and proposes four on-line intra-domain cache management algorithms with different level of autonomicity.
Abstract: The main promise of current research efforts in the area of Information-Centric Networking (ICN) architectures is to optimize the dissemination of information within transient communication relationships of endpoints. Efficient caching of information is key to delivering on this promise. In this paper, we look into achieving this promise from the angle of managed replication of information. Management decisions are made in order to efficiently place replicas of information in dedicated storage devices attached to nodes of the network. In contrast to traditional off-line external management systems we adopt a distributed autonomic management architecture where management intelligence is placed inside the network. Particularly, we present an autonomic cache management approach for ICNs, where distributed managers residing in cache-enabled nodes decide on which information items to cache. We propose four on-line intra-domain cache management algorithms with different level of autonomicity and compare them with respect to performance, complexity, execution time and message exchange overhead. Additionally, we derive a lower bound of the overall network traffic cost for a certain category of network topologies. Our extensive simulations, using realistic network topologies and synthetic workload generators, signify the importance of network wide knowledge and cooperation.

Proceedings ArticleDOI
14 Apr 2013
TL;DR: This work demonstrates that certain cache networks are non-ergodic in that their steady-state characterization depends on the initial state of the system, and establishes several important properties of cache networks, in the form of three independently-sufficient conditions for a cache network to comprise a single ergodic component.
Abstract: Over the past few years Content-Centric Networking, a networking model in which host-to-content communication protocols are introduced, has been gaining much attention. A central component of such an architecture is a large-scale interconnected caching system. To date, the way these Cache Networks operate and perform is still poorly understood. In this work, we demonstrate that certain cache networks are non-ergodic in that their steady-state characterization depends on the initial state of the system. We then establish several important properties of cache networks, in the form of three independently-sufficient conditions for a cache network to comprise a single ergodic component. Each property targets a different aspect of the system - topology, admission control and cache replacement policies. Perhaps most importantly we demonstrate that cache replacement can be grouped into equivalence classes, such that the ergodicity (or lack-thereof) of one policy implies the same property holds for all policies in the class.

Proceedings ArticleDOI
06 May 2013
TL;DR: A novel cache management algorithm for flash-based disk cache, named Lazy Adaptive Replacement Cache (LARC), which can filter out seldom accessed blocks and prevent them from entering cache and improves performance and extends SSD lifetime at the same time.
Abstract: The increasing popularity of flash memory has changed storage systems. Flash-based solid state drive(SSD) is now widely deployed as cache for magnetic hard disk drives(HDD) to speed up data intensive applications. However, existing cache algorithms focus exclusively on performance improvements and ignore the write endurance of SSD. In this paper, we proposed a novel cache management algorithm for flash-based disk cache, named Lazy Adaptive Replacement Cache(LARC). LARC can filter out seldom accessed blocks and prevent them from entering cache. This avoids cache pollution and keeps popular blocks in cache for a longer period of time, leading to higher hit rate. Meanwhile, LARC reduces the amount of cache replacements thus incurs less write traffics to SSD, especially for read dominant workloads. In this way, LARC improves performance and extends SSD lifetime at the same time. LARC is self-tuning and low overhead. It has been extensively evaluated by both trace-driven simulations and a prototype implementation in flashcache. Our experiments show that LARC outperforms state-of-art algorithms and reduces write traffics to SSD by up to 94.5% for read dominant workloads, 11.2-40.8% for write dominant workloads.

Proceedings ArticleDOI
18 Mar 2013
TL;DR: This work proposes DWM-TAPESTRI, a new all-spin cache design that utilizes Domain Wall Memory (DWM) with shift based writes at all levels of the cache hierarchy, and proposes pre-shifting as an architectural technique to hide the latency of shift operations that is inherent to DWM.
Abstract: Spin-based memories are promising candidates for future on-chip memories due to their high density, non-volatility, and very low leakage. However, the high energy and latency of write operations in these memories is a major challenge. In this work, we explore a new approach -- shift based write -- that offers a fast and energy-efficient alternative to performing writes in spin-based memories. We propose DWM-TAPESTRI, a new all-spin cache design that utilizes Domain Wall Memory (DWM) with shift based writes at all levels of the cache hierarchy. The proposed write scheme enables DWM to be used, for the first time, in L1 caches and in tag arrays, where the inefficiency of writes in spin memories has traditionally precluded their use. At the circuit level, we propose bit-cell designs utilizing shift-based writes, which are tailored to the differing requirements of different levels in the cache hierarchy. We also propose pre-shifting as an architectural technique to hide the latency of shift operations that is inherent to DWM. We performed a systematic device-circuit-architecture evaluation of the proposed design. Over a wide range of SPEC 2006 benchmarks, DWM-TAPESTRI achieves 8.2X improvement in energy and 4X improvement in area, with virtually identical performance, compared to an iso-capacity SRAM cache. Compared to an iso-capacity STT-MRAM cache, the proposed design achieves around 1.6X improvement in both area and energy under iso-performance conditions.

Proceedings ArticleDOI
18 Mar 2013
TL;DR: A novel parametric random placement suitable for PTA is proposed that is proven to have low hardware complexity and energy consumption while providing comparable performance to that of conventional modulo placement.
Abstract: Caches provide significant performance improvements, though their use in real-time industry is low because current WCET analysis tools require detailed knowledge of program's cache accesses to provide tight WCET estimates. Probabilistic Timing Analysis (PTA) has emerged as a solution to reduce the amount of information needed to provide tight WCET estimates, although it imposes new requirements on hardware design. At cache level, so far only fully-associative random-replacement caches have been proven to fulfill the needs of PTA, but they are expensive in size and energy. In this paper we propose a cache design that allows set-associative and direct-mapped caches to be analysed with PTA techniques. In particular we propose a novel parametric random placement suitable for PTA that is proven to have low hardware complexity and energy consumption while providing comparable performance to that of conventional modulo placement.

Patent
05 Dec 2013
TL;DR: In this article, a cache and/or storage module may be configured to reduce write amplification in a cache storage, which may occur due to an over-permissive admission policy, or it may arise due to the write-once properties of the storage medium.
Abstract: A cache and/or storage module may be configured to reduce write amplification in a cache storage. Cache layer write amplification (CLWA) may occur due to an over-permissive admission policy. The cache module may be configured to reduce CLWA by configuring admission policies to avoid unnecessary writes. Admission policies may be predicated on access and/or sequentiality metrics. Flash layer write amplification (FLWA) may arise due to the write-once properties of the storage medium. FLWA may be reduced by delegating cache eviction functionality to the underlying storage layer. The cache and storage layers may be configured to communicate coordination information, which may be leveraged to improve the performance of cache and/or storage operations.

Proceedings ArticleDOI
17 Jun 2013
TL;DR: An intuitive performance model for cache-coherent architectures is developed and used to develop several optimal and optimized algorithms for complex parallel data exchanges that beat the performance of the highly-tuned vendor-specific Intel OpenMP and MPI libraries.
Abstract: Most multi-core and some many-core processors implement cache coherency protocols that heavily complicate the design of optimal parallel algorithms. Communication is performed implicitly by cache line transfers between cores, complicating the understanding of performance properties. We developed an intuitive performance model for cache-coherent architectures and demonstrate its use with the currently most scalable cache-coherent many-core architecture, Intel Xeon Phi. Using our model, we develop several optimal and optimized algorithms for complex parallel data exchanges. All algorithms that were developed with the model beat the performance of the highly-tuned vendor-specific Intel OpenMP and MPI libraries by up to a factor of 4.3. The model can be simplified to satisfy the tradeoff between complexity of algorithm design and accuracy. We expect that our model can serve as a vehicle for advanced algorithm design.

Proceedings ArticleDOI
18 Nov 2013
TL;DR: An efficient compiler framework for cache bypassing on GPUs is proposed and efficient algorithms that judiciously select global load instructions for cache access or bypass are presented.
Abstract: Graphics Processing Units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is not ideal for those general purpose applications with irregular memory accesses. Hence, GPU vendors have introduced caches in conjunction with scratchpad memory in the recent generations of GPUs. The caches on GPUs are highly-configurable. The programmer or the compiler can explicitly control cache access or bypass for global load instructions. This highly-configurable feature of GPU caches opens up the opportunities for optimizing the cache performance. In this paper, we propose an efficient compiler framework for cache bypassing on GPUs. Our objective is to efficiently utilize the configurable cache and improve the overall performance for general purpose GPU applications. In order to achieve this goal, we first characterize GPU cache utilization and develop performance metrics to estimate the cache reuses and memory traffic. Next, we present efficient algorithms that judiciously select global load instructions for cache access or bypass. Finally, we integrate our techniques into an automatic compiler framework that leverages PTX instruction set architecture. Experiments evaluation demonstrates that compared to cache-all and bypass-all solutions, our techniques can achieve considerable performance improvement.

Journal ArticleDOI
TL;DR: Analytical models for characterising cache performance trends at storage cache nodes are presented and have potential for guiding efficient resource allocations during initial deployments of the storage cloud infrastructure and timely interventions during operation in order to achieve scalable and resilient service delivery.
Abstract: With the growing popularity of cloud-based data centres as the enterprise IT platform of choice, there is a need for effective management strategies capable of maintaining performance within SLA and QoS parameters when responding to dynamic conditions such as increasing demand. Since current management approaches in the cloud infrastructure, particularly for data-intensive applications, lack the ability to systematically quantify performance trends, static approaches are largely employed in the allocations of resources when dealing with volatile demand in the infrastructure. We present analytical models for characterising cache performance trends at storage cache nodes. Practical validations of cache performance for derived theoretical trends show close approximations between modelled characterisations and measurement results for user request patterns involving private datasets and publicly available datasets. The models are extended to encompass hybrid scenarios based on concurrent requests of both private and public content. Our models have potential for guiding (a) efficient resource allocations during initial deployments of the storage cloud infrastructure and (b) timely interventions during operation in order to achieve scalable and resilient service delivery.

Proceedings ArticleDOI
Tian Luo1, Siyuan Ma1, Rubao Lee1, Xiaodong Zhang1, Deng Liu2, Li Zhou3 
07 Oct 2013
TL;DR: The design and implementation of S-CAVE, a hypervisor-based SSD caching facility, which effectively manages a storage cache in a Multi-VM environment by collecting and exploiting runtime information from both VMs and storage devices is presented.
Abstract: A unique challenge for SSD storage caching management in a virtual machine (VM) environment is to accomplish the dual objectives: maximizing utilization of shared SSD cache devices and ensuring performance isolation among VMs. In this paper, we present our design and implementation of S-CAVE, a hypervisor-based SSD caching facility, which effectively manages a storage cache in a Multi-VM environment by collecting and exploiting runtime information from both VMs and storage devices. Due to a hypervisor's unique position between VMs and hardware resources, S-CAVE does not require any modification to guest OSes, user applications, or the underlying storage system. A critical issue to address in S-CAVE is how to allocate limited and shared SSD cache space among multiple VMs to achieve the dual goals. This is accomplished in two steps. First, we propose an effective metric to determine the demand for SSD cache space of each VM. Next, by incorporating this cache demand information into a dynamic control mechanism, S-CAVE is able to efficiently provide a fair share of cache space to each VM while achieving the goal of best utilizing the shared SSD cache device. In accordance with the constraints of all the functionalities of a hypervisor, S-CAVE incurs minimum overhead in both memory space and computing time. We have implemented S-CAVE in vSphere ESX, a widely used commercial hypervisor from VMWare. Our extensive experiments have shown its strong effectiveness for various data-intensive applications.

Proceedings ArticleDOI
07 Dec 2013
TL;DR: The Decoupled Compressed Cache (DCC) is proposed, which exploits spatial locality to improve both the performance and energy-efficiency of cache compression and nearly doubles the benefits of previous compressed caches with similar area overhead.
Abstract: In multicore processor systems, last-level caches (LLCs) play a crucial role in reducing system energy by i) filtering out expensive accesses to main memory and ii) reducing the time spent executing in high-power states. Cache compression can increase effective cache capacity and reduce misses, improve performance, and potentially reduce system energy. However, previous compressed cache designs have demonstrated only limited benefits due to internal fragmentation and limited tags. In this paper, we propose the Decoupled Compressed Cache (DCC), which exploits spatial locality to improve both the performance and energy-efficiency of cache compression. DCC uses decoupled super-blocks and non-contiguous sub-block allocation to decrease tag overhead without increasing internal fragmentation. Non-contiguous sub-blocks also eliminate the need for energy-expensive re-compaction when a block's size changes. Compared to earlier compressed caches, DCC increases normalized effective capacity to a maximum of 4 and an average of 2.2 for a wide range of workloads. A further optimized Co-DCC (Co-Compacted DCC) design improves the average normalized effective capacity to 2.6 by co-compacting the compressed blocks in a super-block. Our simulations show that DCC nearly doubles the benefits of previous compressed caches with similar area overhead. We also demonstrate a practical DCC design based on a recent commercial LLC design.

Journal ArticleDOI
TL;DR: The evaluation and analysis present the performance bounds of in-network caching on NDN in terms of the practical constraints, such as the link cost, link capacity, and cache size.

Proceedings ArticleDOI
07 Oct 2013
TL;DR: HeLM is able to throttle GPU LLC accesses and yield LLC space to cache sensitive CPU applications and outperforms LRU policy by 12.5% and TAP-RRIP by 5.6% for a processor with 4 CPU and 4 GPU cores.
Abstract: Heterogeneous multicore processors that integrate CPU cores and data-parallel accelerators such as GPU cores onto the same die raise several new issues for sharing various on-chip resources. The shared last-level cache (LLC) is one of the most important shared resources due to its impact on performance. Accesses to the shared LLC in heterogeneous multicore processors can be dominated by the GPU due to the significantly higher number of threads supported. Under current cache management policies, the CPU applications' share of the LLC can be significantly reduced in the presence of competing GPU applications. For cache sensitive CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can often tolerate increased memory access latency in the presence of LLC misses when there is sufficient thread-level parallelism. In this work, we propose Heterogeneous LLC Management (HeLM), a novel shared LLC management policy that takes advantage of the GPU's tolerance for memory access latency. HeLM is able to throttle GPU LLC accesses and yield LLC space to cache sensitive CPU applications. GPU LLC access throttling is achieved by allowing GPU threads that can tolerate longer memory access latencies to bypass the LLC. The latency tolerance of a GPU application is determined by the availability of thread-level parallelism, which can be measured at runtime as the average number of threads that are available for issuing. Our heterogeneous LLC management scheme outperforms LRU policy by 12.5% and TAP-RRIP by 5.6% for a processor with 4 CPU and 4 GPU cores.

Journal ArticleDOI
TL;DR: This work introduces a multicore-oblivious (MO) approach to algorithms and schedulers for HM, and presents efficient MO algorithms for several fundamental problems including matrix transposition, FFT, sorting, the Gaussian Elimination Paradigm, list ranking, and connected components.