Showing papers on "Cache algorithms published in 2013"

PDF

Open Access

Journal Article•DOI•

Caching in information centric networking: A survey

[...]

Guoqiang Zhang¹, Guoqiang Zhang², Yang Li³, Tao Lin³•Institutions (3)

Soochow University (Suzhou)¹, Nanjing Normal University², Chinese Academy of Sciences³

01 Nov 2013-Computer Networks

TL;DR: This paper presents a comprehensive survey of state-of-art techniques aiming to address caching issues, with particular focus on reducing cache redundancy and improving the availability of cached content.

...read moreread less

343 citations

Proceedings Article•DOI•

Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware

[...]

Cagri Balkesen¹, Jens Teubner¹, Gustavo Alonso¹, M.T. Ozsu²•Institutions (2)

ETH Zurich¹, University of Waterloo²

08 Apr 2013

TL;DR: Through the analysis, light is shed on how modern hardware affects the implementation of data operators and the fastest implementation of radix join to date is provided, reaching close to 200 million tuples per second.

...read moreread less

Abstract: The architectural changes introduced with multi-core CPUs have triggered a redesign of main-memory join algorithms. In the last few years, two diverging views have appeared. One approach advocates careful tailoring of the algorithm to the architectural parameters (cache sizes, TLB, and memory bandwidth). The other approach argues that modern hardware is good enough at hiding cache and TLB miss latencies and, consequently, the careful tailoring can be omitted without sacrificing performance. In this paper we demonstrate through experimental analysis of different algorithms and architectures that hardware still matters. Join algorithms that are hardware conscious perform better than hardware-oblivious approaches. The analysis and comparisons in the paper show that many of the claims regarding the behavior of join algorithms that have appeared in literature are due to selection effects (relative table sizes, tuple sizes, the underlying architecture, using sorted data, etc.) and are not supported by experiments run under different parameters settings. Through the analysis, we shed light on how modern hardware affects the implementation of data operators and provide the fastest implementation of radix join to date, reaching close to 200 million tuples per second.

...read moreread less

265 citations

Proceedings Article•DOI•

An analysis of Facebook photo caching

[...]

Qi Huang¹, Kenneth P. Birman¹, Robbert van Renesse¹, Wyatt Lloyd², Sanjeev Kumar³, Harry C. Li³ - Show less +2 more•Institutions (3)

Cornell University¹, Princeton University², Facebook³

03 Nov 2013

TL;DR: This paper instrumented every Facebook-controlled layer of the stack and sampled the resulting event stream to obtain traces covering over 77 million requests for more than 1 million unique photos to study traffic patterns, cache access patterns, geolocation of clients and servers, and to explore correlation between properties of the content and accesses.

...read moreread less

Abstract: This paper examines the workload of Facebook's photo-serving stack and the effectiveness of the many layers of caching it employs Facebook's image-management infrastructure is complex and geographically distributed It includes browser caches on end-user systems, Edge Caches at ~20 PoPs, an Origin Cache, and for some kinds of images, additional caching via Akamai The underlying image storage layer is widely distributed, and includes multiple data centersWe instrumented every Facebook-controlled layer of the stack and sampled the resulting event stream to obtain traces covering over 77 million requests for more than 1 million unique photos This permits us to study traffic patterns, cache access patterns, geolocation of clients and servers, and to explore correlation between properties of the content and accesses Our results (1) quantify the overall traffic percentages served by different layers: 655% browser cache, 200% Edge Cache, 46% Origin Cache, and 99% Backend storage, (2) reveal that a significant portion of photo requests are routed to remote PoPs and data centers as a consequence both of load-balancing and peering policy, (3) demonstrate the potential performance benefits of coordinating Edge Caches and adopting S4LRU eviction algorithms at both Edge and Origin layers, and (4) show that the popularity of photos is highly dependent on content age and conditionally dependent on the social-networking metrics we considered

...read moreread less

225 citations

Patent•

Systems and methods for a de-duplication cache

[...]

Vikram Joshi, Yang Luan, Michael F. Brown, Bhavesh Mehta, Prashanth Radhakrishnan - Show less +1 more

25 Jan 2013

TL;DR: In this paper, a de-duplication cache is configured to cache data for access by a plurality of different storage clients, such as virtual machines, and metadata pertaining to the contents of the cache may be persisted and/or transferred with respective storage clients.

...read moreread less

Abstract: A de-duplication is configured to cache data for access by a plurality of different storage clients, such as virtual machines. A virtual machine may comprise a virtual machine de-duplication module configured to identify data for admission into the de-duplication cache. Data admitted into the de-duplication cache may be accessible by two or more storage clients. Metadata pertaining to the contents of the de-duplication cache may be persisted and/or transferred with respective storage clients such that the storage clients may access the contents of the de-duplication cache after rebooting, being power cycled, and/or being transferred between hosts.

...read moreread less

223 citations

Proceedings Article•DOI•

Real-time cache management framework for multi-core architectures

[...]

Renato Mancuso¹, R. Dudko¹, Emiliano Betti², Marco Cesati², Marco Caccamo¹, Rodolfo Pellizzoni³ - Show less +2 more•Institutions (3)

University of Illinois at Urbana–Champaign¹, University of Rome Tor Vergata², University of Waterloo³

09 Apr 2013

TL;DR: A complete framework to analyze and profile task memory access patterns and a novel kernel-level cache management technique to enforce an efficient and deterministic cache allocation of the most frequently accessed memory areas are proposed.

...read moreread less

Abstract: Multi-core architectures are shaking the fundamental assumption that in real-time systems the WCET, used to analyze the schedulability of the complete system, is calculated on individual tasks. This is not even true in an approximate sense in a modern multi-core chip, due to interference caused by hardware resource sharing. In this work we propose (1) a complete framework to analyze and profile task memory access patterns and (2) a novel kernel-level cache management technique to enforce an efficient and deterministic cache allocation of the most frequently accessed memory areas. In this way, we provide a powerful tool to address one of the main sources of interference in a system where the last level of cache is shared among two or more CPUs. The technique has been implemented on commercial hardware and our evaluations show that it can be used to significantly improve the predictability of a given set of critical tasks.

...read moreread less

207 citations

Proceedings Article•DOI•

Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache

[...]

Djordje Jevdjic¹, Stavros Volos¹, Babak Falsafi¹•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

23 Jun 2013

TL;DR: This paper introduces Footprint Cache, an efficient die-stacked DRAM cache design for server processors that eliminates the excessive off-chip traffic associated with page-based designs, while preserving their high hit ratio, small tag array overhead, and low lookup latency.

...read moreread less

Abstract: Recent research advocates using large die-stacked DRAM caches to break the memory bandwidth wall. Existing DRAM cache designs fall into one of two categories --- block-based and page-based. The former organize data in conventional blocks (e.g., 64B), ensuring low off-chip bandwidth utilization, but co-locate tags and data in the stacked DRAM, incurring high lookup latency. Furthermore, such designs suffer from low hit ratios due to poor temporal locality. In contrast, page-based caches, which manage data at larger granularity (e.g., 4KB pages), allow for reduced tag array overhead and fast lookup, and leverage high spatial locality at the cost of moving large amounts of data on and off the chip.This paper introduces Footprint Cache, an efficient die-stacked DRAM cache design for server processors. Footprint Cache allocates data at the granularity of pages, but identifies and fetches only those blocks within a page that will be touched during the page's residency in the cache --- i.e., the page's footprint. In doing so, Footprint Cache eliminates the excessive off-chip traffic associated with page-based designs, while preserving their high hit ratio, small tag array overhead, and low lookup latency. Cycle-accurate simulation results of a 16-core server with up to 512MB Footprint Cache indicate a 57% performance improvement over a baseline chip without a die-stacked cache. Compared to a state-of-the-art block-based design, our design improves performance by 13% while reducing dynamic energy of stacked DRAM by 24%.

...read moreread less

207 citations

Proceedings Article•

CacheAudit: a tool for the static analysis of cache side channels

[...]

Goran Doychev¹, Dominik Feld², Boris Köpf¹, Laurent Mauborgne¹, Jan Reineke² - Show less +1 more•Institutions (2)

IMDEA¹, Saarland University²

14 Aug 2013

TL;DR: The results obtained exhibit the influence of cache size, line size, associativity, replacement policy, and coding style on the security of the executables and include the first formal proofs of security for implementations with countermeasures such as preloading and data-independent memory access patterns.

...read moreread less

Abstract: We present CacheAudit, a versatile framework for the automatic, static analysis of cache side channels. Cache-Audit takes as input a program binary and a cache configuration, and it derives formal, quantitative security guarantees for a comprehensive set of side-channel adversaries, namely those based on observing cache states, traces of hits and misses, and execution times. Our technical contributions include novel abstractions to efficiently compute precise overapproximations of the possible side-channel observations for each of these adversaries. These approximations then yield upper bounds on the information that is revealed. In case studies we apply CacheAudit to binary executables of algorithms for symmetric encryption and sorting, obtaining the first formal proofs of security for implementations with countermeasures such as preloading and data-independent memory access patterns.

...read moreread less

199 citations

Journal Article•DOI•

QoE-Driven Cache Management for HTTP Adaptive Bit Rate Streaming Over Wireless Networks

[...]

Weiwen Zhang¹, Yonggang Wen¹, Zhenzhong Chen², Ashish Khisti³•Institutions (3)

Nanyang Technological University¹, MediaTek², University of Toronto³

01 Oct 2013-IEEE Transactions on Multimedia

TL;DR: This paper investigates the problem of how to cache a set of media files with optimal streaming rates, under HTTP adaptive bit rate streaming over wireless networks, and finds there is a fundamental phase change in the optimal solution as the number of cached files grows.

...read moreread less

Abstract: In this paper, we investigate the problem of optimal content cache management for HTTP adaptive bit rate (ABR) streaming over wireless networks. Specifically, in the media cloud, each content is transcoded into a set of media files with diverse playback rates, and appropriate files will be dynamically chosen in response to channel conditions and screen forms. Our design objective is to maximize the quality of experience (QoE) of an individual content for the end users, under a limited storage budget. Deriving a logarithmic QoE model from our experimental results, we formulate the individual content cache management for HTTP ABR streaming over wireless network as a constrained convex optimization problem. We adopt a two-step process to solve the snapshot problem. First, using the Lagrange multiplier method, we obtain the numerical solution of the set of playback rates for a fixed number of cache copies and characterize the optimal solution analytically. Our investigation reveals a fundamental phase change in the optimal solution as the number of cached files increases. Second, we develop three alternative search algorithms to find the optimal number of cached files, and compare their scalability under average and worst complexity metrics. Our numerical results suggest that, under optimal cache schemes, the maximum QoE measurement, i.e., mean-opinion-score (MOS), is a concave function of the allowable storage size. Our cache management can provide high expected QoE with low complexity, shedding light on the design of HTTP ABR streaming services over wireless networks.

...read moreread less

186 citations

Proceedings Article•DOI•

Unioning of the buffer cache and journaling layers with non-volatile memory

[...]

Eunji Lee¹, Hyokyung Bahn¹, Sam H. Noh²•Institutions (2)

Ewha Womans University¹, Hongik University²

12 Feb 2013

TL;DR: A novel buffer cache architecture is presented that subsumes the functionality of caching and journaling by making use of non-volatile memory such as PCM or STT-MRAM and shows that this scheme improves I/O performance by 76% on average and up to 240% compared to the existing Linux buffer cache with ext4 without any loss of reliability.

...read moreread less

Abstract: Journaling techniques are widely used in modern file systems as they provide high reliability and fast recovery from system failures. However, it reduces the performance benefit of buffer caching as journaling accounts for a bulk of the storage writes in real system environments. In this paper, we present a novel buffer cache architecture that subsumes the functionality of caching and journaling by making use of non-volatile memory such as PCM or STT-MRAM. Specifically, our buffer cache supports what we call the in-place commit scheme. This scheme avoids logging, but still provides the same journaling effect by simply altering the state of the cached block to frozen. As a frozen block still performs the function of caching, we show that in-place commit does not degrade cache performance. We implement our scheme on Linux 2.6.38 and measure the throughput and execution time of the scheme with various file I/O benchmarks. The results show that our scheme improves I/O performance by 76% on average and up to 240% compared to the existing Linux buffer cache with ext4 without any loss of reliability.

...read moreread less

171 citations

Proceedings Article•DOI•

Hash-routing schemes for information centric networking

[...]

Lorenzo Saino¹, Ioannis Psaras¹, George Pavlou¹•Institutions (1)

University College London¹

12 Aug 2013

TL;DR: This paper designs five different hash-routing schemes which efficiently exploit in-network caches without requiring network routers to maintain per-content state information and shows that such schemes can increase cache hits by up to 31% in comparison to on-path caching, with minimal impact on the traffic dynamics of intra-domain links.

...read moreread less

Abstract: Hash-routing has been proposed in the past as a mapping mechanism between object requests and cache clusters within enterprise networks.In this paper, we revisit hash-routing techniques and apply them to Information-Centric Networking (ICN) environments, where network routers have cache space readily available. In particular, we investigate whether hash-routing is a viable and efficient caching approach when applied outside enterprise networks, but within the boundaries of a domain.We design five different hash-routing schemes which efficiently exploit in-network caches without requiring network routers to maintain per-content state information.We evaluate the proposed hash-routing schemes using extensive simulations over real Internet domain topologies and compare them against various on-path caching mechanisms. We show that such schemes can increase cache hits by up to 31% in comparison to on-path caching, with minimal impact on the traffic dynamics of intra-domain links.

...read moreread less

142 citations

Journal Article•DOI•

A lightweight mechanism for detection of cache pollution attacks in Named Data Networking

[...]

Mauro Conti¹, Paolo Gasti², Marco Teoli¹•Institutions (2)

University of Padua¹, New York Institute of Technology²

01 Nov 2013-Computer Networks

TL;DR: This paper focuses on cache pollution attacks, where the adversary's goal is to disrupt cache locality to increase link utilization and cache misses for honest consumers, and illustrates that existing proactive countermeasures are ineffective against realistic adversaries.

...read moreread less

Patent•

Method for managing content caching based on hop count and network entity thereof

[...]

Hong Seok Jeon¹, Byungjoon Lee¹, Hoyoung Song¹, Seung Hyun Yoon¹•Institutions (1)

Electronics and Telecommunications Research Institute¹

17 Sep 2013

TL;DR: In this article, a hop-count-based content caching scheme is proposed to decrease traffics of a network by the routing node's primarily judging whether to cache a content chunk by grasping an attribute of the received content chunk, and the caching probability of 1/hop count.

...read moreread less

Abstract: Disclosed is hop-count based content caching. The present invention implements hop-count based content cache placement strategies that efficiently decrease traffics of a network by the routing node's primarily judging whether to cache a content chunk by grasping an attribute of the received content chunk; the routing node's secondarily judging whether to cache the content chunk based on a caching probability of ‘1/hop count’; and storing the content chunk and the hop count information in the cache memory of the routing node when the content chunk is determined to cache the content chunk as a result of the secondary judgment.

...read moreread less

Proceedings Article•DOI•

Optimal cache allocation for Content-Centric Networking

[...]

Yonggong Wang, Zhenyu Li, Gareth Tyson¹, Steve Uhlig¹, Gaogang Xie - Show less +1 more•Institutions (1)

Queen Mary University of London¹

01 May 2013

TL;DR: This work focuses on the cache allocation problem: namely, how to distribute the cache capacity across routers under a constrained total storage budget for the network, and formulate this problem as a content placement problem and obtains the exact optimal solution by a two-step method.

...read moreread less

Abstract: Content-Centric Networking (CCN) is a promising framework for evolving the current network architecture, advocating ubiquitous in-network caching to enhance content delivery. Consequently, in CCN, each router has storage space to cache frequently requested content. In this work, we focus on the cache allocation problem: namely, how to distribute the cache capacity across routers under a constrained total storage budget for the network. We formulate this problem as a content placement problem and obtain the exact optimal solution by a two-step method. Through simulations, we use this algorithm to investigate the factors that affect the optimal cache allocation in CCN, such as the network topology and the popularity of content. We find that a highly heterogeneous topology tends to put most of the capacity over a few central nodes. On the other hand, heterogeneous content popularity has the opposite effect, by spreading capacity across far more nodes. Using our findings, we make observations on how network operators could best deploy CCN caches capacity.

...read moreread less

Proceedings Article•DOI•

Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication

[...]

James Demmel¹, David Eliahu¹, Armando Fox¹, Shoaib Kamil², Benjamin Lipshitz¹, Oded Schwartz¹, Omer Spillinger¹ - Show less +3 more•Institutions (2)

University of California, Berkeley¹, Massachusetts Institute of Technology²

20 May 2013

TL;DR: This work obtains the first communication-optimal algorithm for all dimensions of rectangular matrices by combining the dimension-splitting technique with the recursive BFS/DFS approach, and shows significant speedups over existing parallel linear algebra libraries both on a 32-core shared-memory machine and on a distributed-memory supercomputer.

...read moreread less

Abstract: Communication-optimal algorithms are known for square matrix multiplication. Here, we obtain the first communication-optimal algorithm for all dimensions of rectangular matrices. Combining the dimension-splitting technique of Frigo, Leiserson, Prokop and Ramachandran (1999) with the recursive BFS/DFS approach of Ballard, Demmel, Holtz, Lipshitz and Schwartz (2012) allows for a communication-optimal as well as cache and network-oblivious algorithm. Moreover, the implementation is simple: approximately 50 lines of code for the shared-memory version. Since the new algorithm minimizes communication across the network, between NUMA domains, and between levels of cache, it performs well in practice on both shared and distributed-memory machines. We show significant speedups over existing parallel linear algebra libraries both on a 32-core shared-memory machine and on a distributed-memory supercomputer.

...read moreread less

Proceedings Article•DOI•

i 2 WAP: Improving non-volatile cache lifetime by reducing inter- and intra-set write variations

[...]

Jue Wang¹, Xiangyu Dong², Yuan Xie¹, Norm Jouppi³•Institutions (3)

Pennsylvania State University¹, Qualcomm², Hewlett-Packard³

23 Feb 2013

TL;DR: By adopting i2WAP, a new cache management policy that can reduce both inter- and intra-set write variations, this work can improve the lifetime of on-chip non-volatile caches by 75% on average and up to 224%.

...read moreread less

Abstract: Modern computers require large on-chip caches, but the scalability of traditional SRAM and eDRAM caches is constrained by leakage and cell density. Emerging non-volatile memory (NVM) is a promising alternative to build large on-chip caches. However, limited write endurance is a common problem for non-volatile memory technologies. In addition, today's cache management might result in unbalanced write traffic to cache blocks causing heavily-written cache blocks to fail much earlier than others. Unfortunately, existing wear-leveling techniques for NVM-based main memories cannot be simply applied to NVM-based on-chip caches because cache writes have intra-set variations as well as inter-set variations. To solve this problem, we propose i2WAP, a new cache management policy that can reduce both inter- and intra-set write variations. i2WAP has two features: (1) Swap-Shift, an enhancement based on previous main memory wear-leveling to reduce cache inter-set write variations; (2) Probabilistic Set Line Flush, a novel technique to reduce cache intra-set write variations. Implementing i2WAP only needs two global counters and two global registers. By adopting i2WAP, we can improve the lifetime of on-chip non-volatile caches by 75% on average and up to 224%.

...read moreread less

Proceedings Article•DOI•

A Coordinated Approach for Practical OS-Level Cache Management in Multi-core Real-Time Systems

[...]

Hyoseung Kim¹, Arvind Kandhalu², Ragunathan Rajkumar¹•Institutions (2)

Carnegie Mellon University¹, Texas Instruments²

09 Jul 2013

TL;DR: A practical OS-level cache management scheme for multi-core real-time systems that provides predictable cache performance, addresses the aforementioned problems of existing software cache partitioning, and efficiently allocates cache partitions to schedule a given task set is proposed.

...read moreread less

Abstract: Many modern multi-core processors sport a large shared cache with the primary goal of enhancing the statistic performance of computing workloads. However, due to resulting cache interference among tasks, the uncontrolled use of such a shared cache can significantly hamper the predictability and analyzability of multi-core real-time systems. Software cache partitioning has been considered as an attractive approach to address this issue because it does not require any hardware support beyond that available on many modern processors. However, the state-of-the-art software cache partitioning techniques face two challenges: (1) the memory co-partitioning problem, which results in page swapping or waste of memory, and (2) the availability of a limited number of cache partitions, which causes degraded performance. These are major impediments to the practical adoption of software cache partitioning. In this paper, we propose a practical OS-level cache management scheme for multi-core real-time systems. Our scheme provides predictable cache performance, addresses the aforementioned problems of existing software cache partitioning, and efficiently allocates cache partitions to schedule a given task set. We have implemented and evaluated our scheme in Linux/RK running on the Intel Core i7 quad-core processor. Experimental results indicate that, compared to the traditional approaches, our scheme is up to 39% more memory space efficient and consumes up to 25% less cache partitions while maintaining cache predictability. Our scheme also yields a significant utilization benefit that increases with the number of tasks.

...read moreread less

Journal Article•DOI•

Distributed Cache Management in Information-Centric Networks

[...]

Vasilis Sourlas¹, Lazaros Gkatzikis¹, Paris Flegkas, Leandros Tassiulas¹•Institutions (1)

University of Thessaly¹

03 Jun 2013-IEEE Transactions on Network and Service Management

TL;DR: This paper presents an autonomic cache management approach for ICNs, where distributed managers residing in cache-enabled nodes decide on which information items to cache, and proposes four on-line intra-domain cache management algorithms with different level of autonomicity.

...read moreread less

Abstract: The main promise of current research efforts in the area of Information-Centric Networking (ICN) architectures is to optimize the dissemination of information within transient communication relationships of endpoints. Efficient caching of information is key to delivering on this promise. In this paper, we look into achieving this promise from the angle of managed replication of information. Management decisions are made in order to efficiently place replicas of information in dedicated storage devices attached to nodes of the network. In contrast to traditional off-line external management systems we adopt a distributed autonomic management architecture where management intelligence is placed inside the network. Particularly, we present an autonomic cache management approach for ICNs, where distributed managers residing in cache-enabled nodes decide on which information items to cache. We propose four on-line intra-domain cache management algorithms with different level of autonomicity and compare them with respect to performance, complexity, execution time and message exchange overhead. Additionally, we derive a lower bound of the overall network traffic cost for a certain category of network topologies. Our extensive simulations, using realistic network topologies and synthetic workload generators, signify the importance of network wide knowledge and cooperation.

...read moreread less

Proceedings Article•DOI•

On the steady-state of cache networks

[...]

Elisha J. Rosensweig¹, Daniel Sadoc Menasche², Jim Kurose¹•Institutions (2)

University of Massachusetts Amherst¹, Federal University of Rio de Janeiro²

14 Apr 2013

TL;DR: This work demonstrates that certain cache networks are non-ergodic in that their steady-state characterization depends on the initial state of the system, and establishes several important properties of cache networks, in the form of three independently-sufficient conditions for a cache network to comprise a single ergodic component.

...read moreread less

Abstract: Over the past few years Content-Centric Networking, a networking model in which host-to-content communication protocols are introduced, has been gaining much attention. A central component of such an architecture is a large-scale interconnected caching system. To date, the way these Cache Networks operate and perform is still poorly understood. In this work, we demonstrate that certain cache networks are non-ergodic in that their steady-state characterization depends on the initial state of the system. We then establish several important properties of cache networks, in the form of three independently-sufficient conditions for a cache network to comprise a single ergodic component. Each property targets a different aspect of the system - topology, admission control and cache replacement policies. Perhaps most importantly we demonstrate that cache replacement can be grouped into equivalence classes, such that the ergodicity (or lack-thereof) of one policy implies the same property holds for all policies in the class.

...read moreread less

Proceedings Article•DOI•

Improving flash-based disk cache with Lazy Adaptive Replacement

[...]

Sai Huang¹, Qingsong Wei², Jianxi Chen¹, Cheng Chen², Dan Feng¹ - Show less +1 more•Institutions (2)

Huazhong University of Science and Technology¹, Data Storage Institute²

06 May 2013

TL;DR: A novel cache management algorithm for flash-based disk cache, named Lazy Adaptive Replacement Cache (LARC), which can filter out seldom accessed blocks and prevent them from entering cache and improves performance and extends SSD lifetime at the same time.

...read moreread less

Abstract: The increasing popularity of flash memory has changed storage systems. Flash-based solid state drive(SSD) is now widely deployed as cache for magnetic hard disk drives(HDD) to speed up data intensive applications. However, existing cache algorithms focus exclusively on performance improvements and ignore the write endurance of SSD. In this paper, we proposed a novel cache management algorithm for flash-based disk cache, named Lazy Adaptive Replacement Cache(LARC). LARC can filter out seldom accessed blocks and prevent them from entering cache. This avoids cache pollution and keeps popular blocks in cache for a longer period of time, leading to higher hit rate. Meanwhile, LARC reduces the amount of cache replacements thus incurs less write traffics to SSD, especially for read dominant workloads. In this way, LARC improves performance and extends SSD lifetime at the same time. LARC is self-tuning and low overhead. It has been extensively evaluated by both trace-driven simulations and a prototype implementation in flashcache. Our experiments show that LARC outperforms state-of-art algorithms and reduces write traffics to SSD by up to 94.5% for read dominant workloads, 11.2-40.8% for write dominant workloads.

...read moreread less

Proceedings Article•DOI•

DWM-TAPESTRI - an energy efficient all-spin cache using domain wall shift based writes

[...]

Rangharajan Venkatesan¹, Mrigank Sharad¹, Kaushik Roy¹, Anand Raghunathan¹•Institutions (1)

Purdue University¹

18 Mar 2013

TL;DR: This work proposes DWM-TAPESTRI, a new all-spin cache design that utilizes Domain Wall Memory (DWM) with shift based writes at all levels of the cache hierarchy, and proposes pre-shifting as an architectural technique to hide the latency of shift operations that is inherent to DWM.

...read moreread less

Abstract: Spin-based memories are promising candidates for future on-chip memories due to their high density, non-volatility, and very low leakage. However, the high energy and latency of write operations in these memories is a major challenge. In this work, we explore a new approach -- shift based write -- that offers a fast and energy-efficient alternative to performing writes in spin-based memories. We propose DWM-TAPESTRI, a new all-spin cache design that utilizes Domain Wall Memory (DWM) with shift based writes at all levels of the cache hierarchy. The proposed write scheme enables DWM to be used, for the first time, in L1 caches and in tag arrays, where the inefficiency of writes in spin memories has traditionally precluded their use. At the circuit level, we propose bit-cell designs utilizing shift-based writes, which are tailored to the differing requirements of different levels in the cache hierarchy. We also propose pre-shifting as an architectural technique to hide the latency of shift operations that is inherent to DWM. We performed a systematic device-circuit-architecture evaluation of the proposed design. Over a wide range of SPEC 2006 benchmarks, DWM-TAPESTRI achieves 8.2X improvement in energy and 4X improvement in area, with virtually identical performance, compared to an iso-capacity SRAM cache. Compared to an iso-capacity STT-MRAM cache, the proposed design achieves around 1.6X improvement in both area and energy under iso-performance conditions.

...read moreread less

Proceedings Article•DOI•

A cache design for probabilistically analysable real-time systems

[...]

Leonidas Kosmidis¹, Jaume Abella², Eduardo Quinones², Francisco J. Cazorla³•Institutions (3)

Polytechnic University of Catalonia¹, Barcelona Supercomputing Center², Spanish National Research Council³

18 Mar 2013

TL;DR: A novel parametric random placement suitable for PTA is proposed that is proven to have low hardware complexity and energy consumption while providing comparable performance to that of conventional modulo placement.

...read moreread less

Abstract: Caches provide significant performance improvements, though their use in real-time industry is low because current WCET analysis tools require detailed knowledge of program's cache accesses to provide tight WCET estimates. Probabilistic Timing Analysis (PTA) has emerged as a solution to reduce the amount of information needed to provide tight WCET estimates, although it imposes new requirements on hardware design. At cache level, so far only fully-associative random-replacement caches have been proven to fulfill the needs of PTA, but they are expensive in size and energy. In this paper we propose a cache design that allows set-associative and direct-mapped caches to be analysed with PTA techniques. In particular we propose a novel parametric random placement suitable for PTA that is proven to have low hardware complexity and energy consumption while providing comparable performance to that of conventional modulo placement.

...read moreread less

Patent•

Systems and methods for cache endurance

[...]

Nisha Talagala, Ned Plasson, Jingpai Yang, Robert Wood¹, Swaminathan Sundararaman, Gregory N. Gillis - Show less +2 more•Institutions (1)

SanDisk¹

05 Dec 2013

TL;DR: In this article, a cache and/or storage module may be configured to reduce write amplification in a cache storage, which may occur due to an over-permissive admission policy, or it may arise due to the write-once properties of the storage medium.

...read moreread less

Abstract: A cache and/or storage module may be configured to reduce write amplification in a cache storage. Cache layer write amplification (CLWA) may occur due to an over-permissive admission policy. The cache module may be configured to reduce CLWA by configuring admission policies to avoid unnecessary writes. Admission policies may be predicated on access and/or sequentiality metrics. Flash layer write amplification (FLWA) may arise due to the write-once properties of the storage medium. FLWA may be reduced by delegating cache eviction functionality to the underlying storage layer. The cache and storage layers may be configured to communicate coordination information, which may be leveraged to improve the performance of cache and/or storage operations.

...read moreread less

Proceedings Article•DOI•

Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi

[...]

Sabela Ramos¹, Torsten Hoefler²•Institutions (2)

University of A Coruña¹, ETH Zurich²

17 Jun 2013

TL;DR: An intuitive performance model for cache-coherent architectures is developed and used to develop several optimal and optimized algorithms for complex parallel data exchanges that beat the performance of the highly-tuned vendor-specific Intel OpenMP and MPI libraries.

...read moreread less

Abstract: Most multi-core and some many-core processors implement cache coherency protocols that heavily complicate the design of optimal parallel algorithms. Communication is performed implicitly by cache line transfers between cores, complicating the understanding of performance properties. We developed an intuitive performance model for cache-coherent architectures and demonstrate its use with the currently most scalable cache-coherent many-core architecture, Intel Xeon Phi. Using our model, we develop several optimal and optimized algorithms for complex parallel data exchanges. All algorithms that were developed with the model beat the performance of the highly-tuned vendor-specific Intel OpenMP and MPI libraries by up to a factor of 4.3. The model can be simplified to satisfy the tradeoff between complexity of algorithm design and accuracy. We expect that our model can serve as a vehicle for advanced algorithm design.

...read moreread less

Proceedings Article•DOI•

An efficient compiler framework for cache bypassing on GPUs

[...]

Xiaolong Xie¹, Yun Liang¹, Guangyu Sun¹, Deming Chen²•Institutions (2)

Peking University¹, University of Illinois at Urbana–Champaign²

18 Nov 2013

TL;DR: An efficient compiler framework for cache bypassing on GPUs is proposed and efficient algorithms that judiciously select global load instructions for cache access or bypass are presented.

...read moreread less

Abstract: Graphics Processing Units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is not ideal for those general purpose applications with irregular memory accesses. Hence, GPU vendors have introduced caches in conjunction with scratchpad memory in the recent generations of GPUs. The caches on GPUs are highly-configurable. The programmer or the compiler can explicitly control cache access or bypass for global load instructions. This highly-configurable feature of GPU caches opens up the opportunities for optimizing the cache performance. In this paper, we propose an efficient compiler framework for cache bypassing on GPUs. Our objective is to efficiently utilize the configurable cache and improve the overall performance for general purpose GPU applications. In order to achieve this goal, we first characterize GPU cache utilization and develop performance metrics to estimate the cache reuses and memory traffic. Next, we present efficient algorithms that judiciously select global load instructions for cache access or bypass. Finally, we integrate our techniques into an automatic compiler framework that leverages PTX instruction set architecture. Experiments evaluation demonstrates that compared to cache-all and bypass-all solutions, our techniques can achieve considerable performance improvement.

...read moreread less

Journal Article•DOI•

Cache performance models for quality of service compliance in storage clouds

[...]

Ernest Sithole¹, Aaron McConnell¹, Sally McClean¹, Gerard Parr¹, Bryan Scotney¹, Adrian Moore¹, Dave Bustard¹ - Show less +3 more•Institutions (1)

Ulster University¹

10 Jan 2013-Journal of Cloud Computing

TL;DR: Analytical models for characterising cache performance trends at storage cache nodes are presented and have potential for guiding efficient resource allocations during initial deployments of the storage cloud infrastructure and timely interventions during operation in order to achieve scalable and resilient service delivery.

...read moreread less

Abstract: With the growing popularity of cloud-based data centres as the enterprise IT platform of choice, there is a need for effective management strategies capable of maintaining performance within SLA and QoS parameters when responding to dynamic conditions such as increasing demand. Since current management approaches in the cloud infrastructure, particularly for data-intensive applications, lack the ability to systematically quantify performance trends, static approaches are largely employed in the allocations of resources when dealing with volatile demand in the infrastructure. We present analytical models for characterising cache performance trends at storage cache nodes. Practical validations of cache performance for derived theoretical trends show close approximations between modelled characterisations and measurement results for user request patterns involving private datasets and publicly available datasets. The models are extended to encompass hybrid scenarios based on concurrent requests of both private and public content. Our models have potential for guiding (a) efficient resource allocations during initial deployments of the storage cloud infrastructure and (b) timely interventions during operation in order to achieve scalable and resilient service delivery.

...read moreread less

Proceedings Article•DOI•

S-CAVE: effective SSD caching to improve virtual machine storage performance

[...]

Tian Luo¹, Siyuan Ma¹, Rubao Lee¹, Xiaodong Zhang¹, Deng Liu², Li Zhou³ - Show less +2 more•Institutions (3)

Ohio State University¹, VMware², Facebook³

07 Oct 2013

TL;DR: The design and implementation of S-CAVE, a hypervisor-based SSD caching facility, which effectively manages a storage cache in a Multi-VM environment by collecting and exploiting runtime information from both VMs and storage devices is presented.

...read moreread less

Abstract: A unique challenge for SSD storage caching management in a virtual machine (VM) environment is to accomplish the dual objectives: maximizing utilization of shared SSD cache devices and ensuring performance isolation among VMs. In this paper, we present our design and implementation of S-CAVE, a hypervisor-based SSD caching facility, which effectively manages a storage cache in a Multi-VM environment by collecting and exploiting runtime information from both VMs and storage devices. Due to a hypervisor's unique position between VMs and hardware resources, S-CAVE does not require any modification to guest OSes, user applications, or the underlying storage system. A critical issue to address in S-CAVE is how to allocate limited and shared SSD cache space among multiple VMs to achieve the dual goals. This is accomplished in two steps. First, we propose an effective metric to determine the demand for SSD cache space of each VM. Next, by incorporating this cache demand information into a dynamic control mechanism, S-CAVE is able to efficiently provide a fair share of cache space to each VM while achieving the goal of best utilizing the shared SSD cache device. In accordance with the constraints of all the functionalities of a hypervisor, S-CAVE incurs minimum overhead in both memory space and computing time. We have implemented S-CAVE in vSphere ESX, a widely used commercial hypervisor from VMWare. Our extensive experiments have shown its strong effectiveness for various data-intensive applications.

...read moreread less

Proceedings Article•DOI•

Decoupled compressed cache: exploiting spatial locality for energy-optimized compressed caching

[...]

Somayeh Sardashti¹, Darien Wood¹•Institutions (1)

University of Wisconsin-Madison¹

07 Dec 2013

TL;DR: The Decoupled Compressed Cache (DCC) is proposed, which exploits spatial locality to improve both the performance and energy-efficiency of cache compression and nearly doubles the benefits of previous compressed caches with similar area overhead.

...read moreread less

Abstract: In multicore processor systems, last-level caches (LLCs) play a crucial role in reducing system energy by i) filtering out expensive accesses to main memory and ii) reducing the time spent executing in high-power states. Cache compression can increase effective cache capacity and reduce misses, improve performance, and potentially reduce system energy. However, previous compressed cache designs have demonstrated only limited benefits due to internal fragmentation and limited tags. In this paper, we propose the Decoupled Compressed Cache (DCC), which exploits spatial locality to improve both the performance and energy-efficiency of cache compression. DCC uses decoupled super-blocks and non-contiguous sub-block allocation to decrease tag overhead without increasing internal fragmentation. Non-contiguous sub-blocks also eliminate the need for energy-expensive re-compaction when a block's size changes. Compared to earlier compressed caches, DCC increases normalized effective capacity to a maximum of 4 and an average of 2.2 for a wide range of workloads. A further optimized Co-DCC (Co-Compacted DCC) design improves the average normalized effective capacity to 2.6 by co-compacting the compressed blocks in a super-block. Our simulations show that DCC nearly doubles the benefits of previous compressed caches with similar area overhead. We also demonstrate a practical DCC design based on a recent commercial LLC design.

...read moreread less

Journal Article•DOI•

Performance analysis of in-network caching for content-centric networking

[...]

Yusung Kim, Ikjun Yeom

01 Sep 2013-Computer Networks

TL;DR: The evaluation and analysis present the performance bounds of in-network caching on NDN in terms of the practical constraints, such as the link cost, link capacity, and cache size.

...read moreread less

Proceedings Article•DOI•

Managing shared last-level cache in a heterogeneous multicore processor

[...]

Vineeth Mekkat¹, Anup Holey¹, Pen-Chung Yew¹, Antonia Zhai¹•Institutions (1)

University of Minnesota¹

07 Oct 2013

TL;DR: HeLM is able to throttle GPU LLC accesses and yield LLC space to cache sensitive CPU applications and outperforms LRU policy by 12.5% and TAP-RRIP by 5.6% for a processor with 4 CPU and 4 GPU cores.

...read moreread less

Abstract: Heterogeneous multicore processors that integrate CPU cores and data-parallel accelerators such as GPU cores onto the same die raise several new issues for sharing various on-chip resources. The shared last-level cache (LLC) is one of the most important shared resources due to its impact on performance. Accesses to the shared LLC in heterogeneous multicore processors can be dominated by the GPU due to the significantly higher number of threads supported. Under current cache management policies, the CPU applications' share of the LLC can be significantly reduced in the presence of competing GPU applications. For cache sensitive CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can often tolerate increased memory access latency in the presence of LLC misses when there is sufficient thread-level parallelism. In this work, we propose Heterogeneous LLC Management (HeLM), a novel shared LLC management policy that takes advantage of the GPU's tolerance for memory access latency. HeLM is able to throttle GPU LLC accesses and yield LLC space to cache sensitive CPU applications. GPU LLC access throttling is achieved by allowing GPU threads that can tolerate longer memory access latencies to bypass the LLC. The latency tolerance of a GPU application is determined by the availability of thread-level parallelism, which can be measured at runtime as the average number of threads that are available for issuing. Our heterogeneous LLC management scheme outperforms LRU policy by 12.5% and TAP-RRIP by 5.6% for a processor with 4 CPU and 4 GPU cores.

...read moreread less

Journal Article•DOI•

Oblivious algorithms for multicores and networks of processors

[...]

Rezaul Chowdhury¹, Vijaya Ramachandran², Francesco Silvestri³, Brandon Blakeley⁴•Institutions (4)

Stony Brook University¹, University of Texas at Austin², University of Padua³, University of Washington⁴

01 Jul 2013-Journal of Parallel and Distributed Computing

TL;DR: This work introduces a multicore-oblivious (MO) approach to algorithms and schedulers for HM, and presents efficient MO algorithms for several fundamental problems including matrix transposition, FFT, sorting, the Gaussian Elimination Paradigm, list ranking, and connected components.

...read moreread less

Collapse