Showing papers on "Cache published in 2012"

PDF

Open Access

Proceedings Article•DOI•

Workload analysis of a large-scale key-value store

[...]

Berk Atikoglu¹, Yuehai Xu², Eitan Frachtenberg³, Song Jiang², Mike Paleczny³ - Show less +1 more•Institutions (3)

Stanford University¹, Wayne State University², Facebook³

11 Jun 2012

TL;DR: This paper collects detailed traces from Facebook's Memcached deployment, arguably the world's largest, and analyzes the workloads from multiple angles, including: request composition, size, and rate; cache efficacy; temporal patterns; and application use cases.

...read moreread less

Abstract: Key-value stores are a vital component in many scale-out enterprises, including social networks, online retail, and risk analysis. Accordingly, they are receiving increased attention from the research community in an effort to improve their performance, scalability, reliability, cost, and power consumption. To be effective, such efforts require a detailed understanding of realistic key-value workloads. And yet little is known about these workloads outside of the companies that operate them. This paper aims to address this gap.To this end, we have collected detailed traces from Facebook's Memcached deployment, arguably the world's largest. The traces capture over 284 billion requests from five different Memcached use cases over several days. We analyze the workloads from multiple angles, including: request composition, size, and rate; cache efficacy; temporal patterns; and application use cases. We also propose a simple model of the most representative trace to enable the generation of more realistic synthetic workloads by the community.Our analysis details many characteristics of the caching workload. It also reveals a number of surprises: a GET/SET ratio of 30:1 that is higher than assumed in the literature; some applications of Memcached behave more like persistent storage than a cache; strong locality metrics, such as keys accessed many millions of times a day, do not always suffice for a high hit rate; and there is still room for efficiency and hit rate improvements in Memcached's implementation. Toward the last point, we make several suggestions that address the exposed deficiencies.

...read moreread less

880 citations

Proceedings Article•DOI•

Probabilistic in-network caching for information-centric networks

[...]

Ioannis Psaras¹, Wei Koong Chai¹, George Pavlou¹•Institutions (1)

University College London¹

17 Aug 2012

TL;DR: The results show reduction of up to 20% in server hits, and up to 10% in the number of hops required to hit cached contents, but, most importantly, reduction of cache-evictions by an order of magnitude in comparison to universal caching.

...read moreread less

Abstract: In-network caching necessitates the transformation of centralised operations of traditional, overlay caching techniques to a decentralised and uncoordinated environment. Given that caching capacity in routers is relatively small in comparison to the amount of forwarded content, a key aspect is balanced distribution of content among the available caches. In this paper, we are concerned with decentralised, real-time distribution of content in router caches. Our goal is to reduce caching redundancy and in turn, make more efficient utilisation of available cache resources along a delivery path.Our in-network caching scheme, called ProbCache, approximates the caching capability of a path and caches contents probabilistically in order to: i) leave caching space for other flows sharing (part of) the same path, and ii) fairly multiplex contents of different flows among caches of a shared path.We compare our algorithm against universal caching and against schemes proposed in the past for Web-Caching architectures, such as Leave Copy Down (LCD). Our results show reduction of up to 20% in server hits, and up to 10% in the number of hops required to hit cached contents, but, most importantly, reduction of cache-evictions by an order of magnitude in comparison to universal caching.

...read moreread less

615 citations

Posted Content•

Fundamental Limits of Caching

[...]

Mohammad Ali Maddah-Ali¹, Urs Niesen¹•Institutions (1)

Bell Labs¹

26 Sep 2012-arXiv: Information Theory

TL;DR: In this article, the authors proposed a coded caching scheme that exploits both local and global caching gains, leading to a multiplicative improvement in the peak rate compared to previously known schemes, in particular the improvement can be on the order of the number of users in the network.

...read moreread less

Abstract: Caching is a technique to reduce peak traffic rates by prefetching popular content into memories at the end users. Conventionally, these memories are used to deliver requested content in part from a locally cached copy rather than through the network. The gain offered by this approach, which we term local caching gain, depends on the local cache size (i.e, the memory available at each individual user). In this paper, we introduce and exploit a second, global, caching gain not utilized by conventional caching schemes. This gain depends on the aggregate global cache size (i.e., the cumulative memory available at all users), even though there is no cooperation among the users. To evaluate and isolate these two gains, we introduce an information-theoretic formulation of the caching problem focusing on its basic structure. For this setting, we propose a novel coded caching scheme that exploits both local and global caching gains, leading to a multiplicative improvement in the peak rate compared to previously known schemes. In particular, the improvement can be on the order of the number of users in the network. Moreover, we argue that the performance of the proposed scheme is within a constant factor of the information-theoretic optimum for all values of the problem parameters.

...read moreread less

581 citations

Proceedings Article•DOI•

Cache craftiness for fast multicore key-value storage

[...]

Yandong Mao¹, Eddie Kohler², Robert Morris¹•Institutions (2)

Massachusetts Institute of Technology¹, Harvard University²

10 Apr 2012

TL;DR: This work presents Masstree, a fast key-value database designed for SMP machines, which is comparable to that of memcached, a non-persistent hash table server, and higher than that of VoltDB, MongoDB, and Redis.

...read moreread less

Abstract: We present Masstree, a fast key-value database designed for SMP machines. Masstree keeps all data in memory. Its main data structure is a trie-like concatenation of B+-trees, each of which handles a fixed-length slice of a variable-length key. This structure effectively handles arbitrary-length possiblybinary keys, including keys with long shared prefixes. +-tree fanout was chosen to minimize total DRAM delay when descending the tree and prefetching each tree node. Lookups use optimistic concurrency control, a read-copy-update-like technique, and do not write shared data structures; updates lock only affected nodes. Logging and checkpointing provide consistency and durability. Though some of these ideas appear elsewhere, Masstree is the first to combine them. We discuss design variants and their consequences.On a 16-core machine, with logging enabled and queries arriving over a network, Masstree executes more than six million simple queries per second. This performance is comparable to that of memcached, a non-persistent hash table server, and higher (often much higher) than that of VoltDB, MongoDB, and Redis.

...read moreread less

412 citations

Proceedings Article•DOI•

Cache-Conscious Wavefront Scheduling

[...]

Timothy G. Rogers¹, Mike O'Connor², Tor M. Aamodt¹•Institutions (2)

University of British Columbia¹, Advanced Micro Devices²

01 Dec 2012

TL;DR: This paper proposes Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity.

...read moreread less

Abstract: This paper studies the effects of hardware thread scheduling on cache management in GPUs. We propose Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity. In contrast to improvements in the replacement policy that can better tolerate difficult access patterns, CCWS shapes the access pattern to avoid thrashing the shared L1. We show that CCWS can outperform any replacement scheme by evaluating against the Belady-optimal policy. Our evaluation demonstrates that cache efficiency and preservation of intra-wave front locality become more important as GPU computing expands beyond use in high performance computing. At an estimated cost of 0.17% total chip area, CCWS reduces the number of threads actively issued on a core when appropriate. This leads to an average 25% fewer L1 data cache misses which results in a harmonic mean 24% performance improvement over previously proposed scheduling policies across a diverse selection of cache-sensitive workloads.

...read moreread less

408 citations

Book Chapter•DOI•

Cache less for more in information-centric networks

[...]

Wei Koong Chai¹, Diliang He¹, Ioannis Psaras¹, George Pavlou¹•Institutions (1)

University College London¹

21 May 2012

TL;DR: A centrality-based caching algorithm is proposed by exploiting the concept of (ego network) betweenness centrality to improve the caching gain and eliminate the uncertainty in the performance of the simplistic random caching strategy.

...read moreread less

Abstract: Ubiquitous in-network caching is one of the key aspects of information-centric networking (ICN) which has recently received widespread research interest. In one of the key relevant proposals known as Networking Named Content (NNC), the premise is that leveraging in-network caching to store content in every node it traverses along the delivery path can enhance content delivery. We question such indiscriminate universal caching strategy and investigate whether caching less can actually achieve more . Specifically, we investigate if caching only in a subset of node(s) along the content delivery path can achieve better performance in terms of cache and server hit rates. In this paper, we first study the behavior of NNC's ubiquitous caching and observe that even naive random caching at one intermediate node within the delivery path can achieve similar and, under certain conditions, even better caching gain. We propose a centrality-based caching algorithm by exploiting the concept of (ego network) betweenness centrality to improve the caching gain and eliminate the uncertainty in the performance of the simplistic random caching strategy. Our results suggest that our solution can consistently achieve better gain across both synthetic and real network topologies that have different structural properties.

...read moreread less

360 citations

Proceedings Article•DOI•

[...]

Kideok Cho¹, Munyoung Lee¹, Kunwoo Park¹, Ted Taekyoung Kwon¹, Yanghee Choi¹, Sangheon Pack² - Show less +2 more•Institutions (2)

Seoul National University¹, Korea University²

25 Mar 2012

TL;DR: This paper proposes a content caching scheme, WAVE, in which the number of chunks to be cached is adjusted based on the popularity of the content, which achieves higher cache hit ratio and fewer frequent cache replacements than other on-demand caching strategies.

...read moreread less

Abstract: In content-oriented networking, content files are typically cached in network nodes, and hence how to cache content files is crucial for the efficient content delivery and cache storage utilization. In this paper, we propose a content caching scheme, WAVE, in which the number of chunks to be cached is adjusted based on the popularity of the content. In WAVE, an upstream node recommends the number of chunks to be cached at its downstream node, which is exponentially increased as the request count increases. Simulation results reveal that the average hop count of content delivery of WAVE is lower than other schemes, and the inter-ISP traffic volume of WAVE is the second lowest (CDN is the lowest). Also, WAVE achieves higher cache hit ratio and fewer frequent cache replacements than other on-demand caching strategies.

...read moreread less

349 citations

Proceedings Article•DOI•

Base-delta-immediate compression: practical data compression for on-chip caches

[...]

Gennady Pekhimenko¹, Vivek Seshadri¹, Onur Mutlu¹, Michael Kozuch², Phillip B. Gibbons², Todd C. Mowry¹ - Show less +2 more•Institutions (2)

Carnegie Mellon University¹, Intel²

19 Sep 2012

TL;DR: There is a need for a simple yet efficient compression technique that can effectively compress common in-cache data patterns, and has minimal effect on cache access latency.

...read moreread less

Abstract: Cache compression is a promising technique to increase on-chip cache capacity and to decrease on-chip and off-chip bandwidth usage. Unfortunately, directly applying well-known compression algorithms (usually implemented in software) leads to high hardware complexity and unacceptable decompression/compression latencies, which in turn can negatively affect performance. Hence, there is a need for a simple yet efficient compression technique that can effectively compress common in-cache data patterns, and has minimal effect on cache access latency.

...read moreread less

348 citations

Proceedings Article•

STEALTHMEM: system-level protection against cache-based side channel attacks in the cloud

[...]

Taesoo Kim¹, Marcus Peinado², Gloria Mainar-Ruiz²•Institutions (2)

Massachusetts Institute of Technology¹, Microsoft²

08 Aug 2012

TL;DR: STEALTHMEM is presented, a system-level protection mechanism against cache-based side channel attacks in the cloud and a novel idea and prototype for isolating cache lines while fully utilizing memory by exploiting architectural properties of set-associative caches.

...read moreread less

Abstract: Cloud services are rapidly gaining adoption due to the promises of cost efficiency, availability, and on-demand scaling. To achieve these promises, cloud providers share physical resources to support multi-tenancy of cloud platforms. However, the possibility of sharing the same hardware with potential attackers makes users reluctant to offload sensitive data into the cloud. Worse yet, researchers have demonstrated side channel attacks via shared memory caches to break full encryption keys of AES, DES, and RSA. We present STEALTHMEM, a system-level protection mechanism against cache-based side channel attacks in the cloud. STEALTHMEM manages a set of locked cache lines per core, which are never evicted from the cache, and efficiently multiplexes them so that each VM can load its own sensitive data into the locked cache lines. Thus, any VM can hide memory access patterns on confidential data from other VMs. Unlike existing state-of-the-art mitigation methods, STEALTHMEM works with existing commodity hardware and does not require profound changes to application software. We also present a novel idea and prototype for isolating cache lines while fully utilizing memory by exploiting architectural properties of set-associative caches. STEALTHMEM imposes 5.9% of performance overhead on the SPEC 2006 CPU benchmark, and between 2% and 5% overhead on secured AES, DES and Blowfish, requiring only between 3 and 34 lines of code changes from the original implementations.

...read moreread less

336 citations

Proceedings Article•

PACMan: coordinated memory caching for parallel jobs

[...]

Ganesh Ananthanarayanan¹, Ali Ghodsi¹, Andrew Wang¹, Dhruba Borthakur², Srikanth Kandula³, Scott Shenker¹, Ion Stoica¹ - Show less +3 more•Institutions (3)

University of California, Berkeley¹, Facebook², Microsoft³

25 Apr 2012

TL;DR: PACMan, a caching service that coordinates access to the distributed caches that reduces average completion time of jobs, and improves efficiency of the cluster by 47% and 54%, respectively, on production workloads from Facebook and Microsoft Bing.

...read moreread less

Abstract: Data-intensive analytics on large clusters is important for modern Internet services. As machines in these clusters have large memories, in-memory caching of inputs is an effective way to speed up these analytics jobs. The key challenge, however, is that these jobs run multiple tasks in parallel and a job is sped up only when inputs of all such parallel tasks are cached. Indeed, a single task whose input is not cached can slow down the entire job. To meet this "all-or-nothing" property, we have built PACMan, a caching service that coordinates access to the distributed caches. This coordination is essential to improve job completion times and cluster efficiency. To this end, we have implemented two cache replacement policies on top of PACMan's coordinated infrastructure fb-LIFE that minimizes average completion time by evicting large incomplete inputs, and LFU-F that maximizes cluster efficiency by evicting less frequently accessed inputs. Evaluations on production workloads from Facebook and Microsoft Bing show that PACMan reduces average completion time of jobs by 53% and 51% (small interactive jobs improve by 77%), and improves efficiency of the cluster by 47% and 54%, respectively.

...read moreread less

331 citations

Proceedings Article•DOI•

A Performance Interference Model for Managing Consolidated Workloads in QoS-Aware Clouds

[...]

Qian Zhu¹, Teresa Tung¹•Institutions (1)

Accenture¹

24 Jun 2012

TL;DR: Using the proposed interference model to optimize the cloud provider's metric (here the number of successfully executed applications) to realize better workload placement decisions and thereby maintaining the user's application QoS.

...read moreread less

Abstract: Cloud computing offers users the ability to access large pools of computational and storage resources on-demand without the burden of managing and maintaining their own IT assets. Today's cloud providers charge users based upon the amount of resources used or reserved, with only minimal guarantees of the quality-of-service (QoS) experienced byte users applications. As virtualization technologies proliferate among cloud providers, consolidating multiple user applications onto multi-core servers increases revenue and improves resource utilization. However, consolidation introduces performance interference between co-located workloads, which significantly impacts application QoS. A critical requirement for effective consolidation is to be able to predict the impact of application performance in the presence of interference from on-chip resources, e.g., CPU and last-level cache (LLC)/memory bandwidth sharing, to storage devices and network bandwidth contention. In this work, we propose an interference model which predicts the application QoS metric. The key distinctive feature is the consideration of time-variant inter-dependency among different levels of resource interference. We use applications from a test suite and SPECWeb2005 to illustrate the effectiveness of our model and an average prediction error of less than 8% is achieved. Furthermore, we demonstrate using the proposed interference model to optimize the cloud provider's metric (here the number of successfully executed applications) to realize better workload placement decisions and thereby maintaining the user's application QoS.

...read moreread less

Journal Article•DOI•

Why on-chip cache coherence is here to stay

[...]

Milo M. K. Martin¹, Mark D. Hill², Daniel J. Sorin³•Institutions (3)

University of Pennsylvania¹, University of Wisconsin-Madison², Duke University³

01 Jul 2012-Communications of The ACM

TL;DR: On-chip hardware coherence can scale gracefully as the number of cores increases, and the value of these cores can increase with increasing number of processors.

...read moreread less

Abstract: Today’s multicore chips commonly implement shared memory with cache coherence as low-level support for operating systems and application software. Technology trends continue to enable the scaling of the number of (processor) cores per chip. Because conventional wisdom says that the coherence does not scale well to many cores, some prognosticators predict the end of coherence. This paper seeks to refute this conventional wisdom by showing one way to scale on-chip cache coherence with bounded, modest costs by combining known techniques such as: shared caches augmented to track cached copies, explicit cache eviction notifications, and hierarchical design. Based on this scalable proof-of-concept design, we predict that on-chip coherence and the programming convenience and compatibility it provides are here to stay.

...read moreread less

Proceedings Article•DOI•

Vivisecting YouTube: An active measurement study

[...]

Vijay Kumar Adhikari¹, S. Jain¹, Yingying Chen¹, Zhi-Li Zhang¹•Institutions (1)

University of Minnesota¹

25 Mar 2012

TL;DR: The design of YouTube video delivery system consists of a “flat” video id space, multiple DNS namespaces reflecting a multi-layered logical organization of video servers, and a 3-tier physical cache hierarchy.

...read moreread less

Abstract: We deduce key design features behind the YouTube video delivery system by building a distributed active measurement infrastructure, and collecting and analyzing a large volume of video playback logs, DNS mappings and latency data. We find that the design of YouTube video delivery system consists of three major components: a “flat” video id space, multiple DNS namespaces reflecting a multi-layered logical organization of video servers, and a 3-tier physical cache hierarchy. We also uncover that YouTube employs a set of sophisticated mechanisms to handle video delivery dynamics such as cache misses and load sharing among its distributed cache locations and data centers.

...read moreread less

Proceedings Article•DOI•

Impact of traffic mix on caching performance in a content-centric network

[...]

Christine Fricker¹, Philippe Robert¹, James Roberts¹, Nada Sbihi¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

25 Mar 2012

TL;DR: Results demonstrate that caching VoD in access routers offers a highly favorable bandwidth memory tradeoff but that the other types of content would likely be more efficiently handled in very large capacity storage devices in the core.

...read moreread less

Abstract: For a realistic traffic mix, we evaluate the hit rates attained in a two-layer cache hierarchy designed to reduce Internet bandwidth requirements. The model identifies four main types of content, web, file sharing, user generated content and video on demand, distinguished in terms of their traffic shares, their population and object sizes and their popularity distributions. Results demonstrate that caching VoD in access routers offers a highly favorable bandwidth memory tradeoff but that the other types of content would likely be more efficiently handled in very large capacity storage devices in the core. Evaluations are based on a simple approximation for LRU cache performance that proves highly accurate in relevant configurations.

...read moreread less

Proceedings Article•DOI•

iDedup: latency-aware, inline data deduplication for primary storage

[...]

Kiran Srinivasan¹, Tim Bisson¹, Garth R. Goodson¹, Kaladhar Voruganti¹•Institutions (1)

NetApp¹

14 Feb 2012

TL;DR: iDedup achieves 60-70% of the maximum deduplication with less than a 5% CPU overhead and a 2-4% latency impact, and allows us to tradeoff capacity savings for performance, as demonstrated in the evaluation with real-world workloads.

...read moreread less

Abstract: Deduplication technologies are increasingly being deployed to reduce cost and increase space-efficiency in corporate data centers. However, prior research has not applied deduplication techniques inline to the request path for latency sensitive, primary workloads. This is primarily due to the extra latency these techniques introduce. Inherently, deduplicating data on disk causes fragmentation that increases seeks for subsequent sequential reads of the same data, thus, increasing latency. In addition, deduplicating data requires extra disk IOs to access on-disk deduplication metadata. In this paper, we propose an inline deduplication solution, iDedup, for primary workloads, while minimizing extra IOs and seeks.Our algorithm is based on two key insights from real-world workloads: i) spatial locality exists in duplicated primary data; and ii) temporal locality exists in the access patterns of duplicated data. Using the first insight, we selectively deduplicate only sequences of disk blocks. This reduces fragmentation and amortizes the seeks caused by deduplication. The second insight allows us to replace the expensive, on-disk, deduplication metadata with a smaller, in-memory cache. These techniques enable us to tradeoff capacity savings for performance, as demonstrated in our evaluation with real-world workloads. Our evaluation shows that iDedup achieves 60-70% of the maximum deduplication with less than a 5% CPU overhead and a 2-4% latency impact.

...read moreread less

Proceedings Article•

Whispers in the hyper-space: high-speed covert channel attacks in the cloud

[...]

Zhenyu Wu¹, Zhang Xu¹, Haining Wang¹•Institutions (1)

College of William & Mary¹

08 Aug 2012

TL;DR: This paper presents a novel covert channel attack that is capable of high-bandwidth and reliable data transmission in the cloud, and designs and implements a robust communication protocol, and demonstrates realistic covert channel attacks on various virtualized ×86 systems.

...read moreread less

Abstract: Information security and privacy in general are major concerns that impede enterprise adaptation of shared or public cloud computing. Specifically, the concern of virtual machine (VM) physical co-residency stems from the threat that hostile tenants can leverage various forms of side channels (such as cache covert channels) to exfiltrate sensitive information of victims on the same physical system. However, on virtualized ×86 systems, covert channel attacks have not yet proven to be practical, and thus the threat is widely considered a "potential risk". In this paper, we present a novel covert channel attack that is capable of high-bandwidth and reliable data transmission in the cloud. We first study the application of existing cache channel techniques in a virtualized environment, and uncover their major insufficiency and difficulties. We then overcome these obstacles by (1) redesigning a pure timing-based data transmission scheme, and (2) exploiting the memory bus as a high-bandwidth covert channel medium. We further design and implement a robust communication protocol, and demonstrate realistic covert channel attacks on various virtualized ×86 systems. Our experiments show that covert channels do pose serious threats to information security in the cloud. Finally, we discuss our insights on covert channel mitigation in virtualized environments.

...read moreread less

Proceedings Article•DOI•

Cache revive: architecting volatile STT-RAM caches for enhanced performance in CMPs

[...]

Adwait Jog¹, Asit K. Mishra², Cong Xu¹, Yuan Xie¹, Vijaykrishnan Narayanan¹, Ravishankar Iyer², Chita R. Das¹ - Show less +3 more•Institutions (2)

Pennsylvania State University¹, Intel²

03 Jun 2012

TL;DR: This work forms the relationship between retention-time and write-latency, and finds optimal retention- time for architecting an efficient cache hierarchy using STT-RAM to overcome high write latency and energy problems.

...read moreread less

Abstract: High density, low leakage and non-volatility are the attractive features of Spin-Transfer-Torque-RAM (STT-RAM), which has made it a strong competitor against SRAM as a universal memory replacement in multi-core systems. However, STT-RAM suffers from high write latency and energy which has impeded its widespread adoption. To this end, we look at trading-off STT-RAM's non-volatility property (data-retention-time) to overcome these problems. We formulate the relationship between retention-time and write-latency, and find optimal retention-time for architecting an efficient cache hierarchy using STT-RAM. Our results show that, compared to SRAM-based design, our proposal can improve performance and energy consumption by 18% and 60%, respectively.

...read moreread less

Proceedings Article•DOI•

Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design

[...]

Moinuddin K. Qureshi¹, Gabe H. Loh²•Institutions (2)

Georgia Institute of Technology¹, Advanced Micro Devices²

01 Dec 2012

TL;DR: This paper proposes a latency-optimized cache architecture, called Alloy Cache, that eliminates the delay due to tag serialization by streaming tag and data together in a single burst, and proposes a simple and highly effective Memory Access Predictor.

...read moreread less

Abstract: This paper analyzes the design trade-offs in architecting large-scale DRAM caches. Prior research, including the recent work from Loh and Hill, have organized DRAM caches similar to conventional caches. In this paper, we contend that some of the basic design decisions typically made for conventional caches (such as serialization of tag and data access, large associativity, and update of replacement state) are detrimental to the performance of DRAM caches, as they exacerbate the already high hit latency. We show that higher performance can be obtained by optimizing the DRAM cache architecture first for latency, and then for hit rate. We propose a latency-optimized cache architecture, called Alloy Cache, that eliminates the delay due to tag serialization by streaming tag and data together in a single burst. We also propose a simple and highly effective Memory Access Predictor that incurs a storage overhead of 96 bytes per core and a latency of 1 cycle. It helps service cache misses faster without the need to wait for a cache miss detection in the common case. Our evaluations show that our latency-optimized cache design significantly outperforms both the recent proposal from Loh and Hill, as well as an impractical SRAM Tag-Store design that incurs an unacceptable overhead of several tens of megabytes. On average, the proposal from Loh and Hill provides 8.7% performance improvement, the "idealized" SRAM Tag design provides 24%, and our simple latency-optimized design provides 35%.

...read moreread less

Proceedings Article•DOI•

A versatile and accurate approximation for LRU cache performance

[...]

Christine Fricker¹, Philippe Robert¹, James Roberts¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

04 Sep 2012

TL;DR: The approximation is particularly useful in evaluating the performance of current proposals for an information centric network where other approaches fail due to the very large populations of cacheable objects to be taken into account and to their complex popularity law.

...read moreread less

Abstract: In a 2002 paper, Che and co-authors proposed a simple approach for estimating the hit rates of a cache operating the least recently used (LRU) replacement policy. The approximation proves remarkably accurate and is applicable to quite general distributions of object popularity. This paper provides a mathematical explanation for the success of the approximation, notably in configurations where the intuitive arguments of Che et al. clearly do not apply. The approximation is particularly useful in evaluating the performance of current proposals for an information centric network where other approaches fail due to the very large populations of cacheable objects to be taken into account and to their complex popularity law, resulting from the mix of different content types and the filtering effect induced by the lower layers in a cache hierarchy.

...read moreread less

Journal Article•DOI•

DAGuE: A generic distributed DAG engine for High Performance Computing

[...]

George Bosilca¹, Aurelien Bouteiller¹, Anthony Danalis¹, Thomas Herault¹, Pierre Lemarinier², Jack Dongarra³ - Show less +2 more•Institutions (3)

University of Tennessee¹, University of Rennes², Oak Ridge National Laboratory³

01 Jan 2012

TL;DR: DAGuE is presented, a generic framework for architecture aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures and uses a dynamic, fully-distributed scheduler based on cache awareness, data-locality and task priority.

...read moreread less

Abstract: The frenetic development of the current architectures places a strain on the current state-of-the-art programming environments. Harnessing the full potential of such architectures is a tremendous task for the whole scientific computing community. We present DAGuE a generic framework for architecture aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures. Applications we consider can be expressed as a Direct Acyclic Graph of tasks with labeled edges designating data dependencies. DAGs are represented in a compact, problem-size independent format that can be queried on-demand to discover data dependencies, in a totally distributed fashion. DAGuE assigns computation threads to the cores, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on cache awareness, data-locality and task priority. We demonstrate the efficiency of our approach, using several micro-benchmarks to analyze the performance of different components of the framework, and a linear algebra factorization as a use case.

...read moreread less

Proceedings Article•DOI•

Transactional Memory Architecture and Implementation for IBM System Z

[...]

Christian Jacobi¹, Timothy J. Slegel¹, Dan F. Greiner¹•Institutions (1)

IBM¹

01 Dec 2012

TL;DR: The implementation in the IBM zEnterprise EC12 (zEC12) microprocessor generation, focusing on how transactional memory can be embedded into the existing cache design and multiprocessor shared-memory infrastructure, is described.

...read moreread less

Abstract: We present the introduction of transactional memory into the next generation IBM System z CPU. We first describe the instruction-set architecture features, including requirements for enterprise-class software RAS. We then describe the implementation in the IBM zEnterprise EC12 (zEC12) microprocessor generation, focusing on how transactional memory can be embedded into the existing cache design and multiprocessor shared-memory infrastructure. We explain practical reasons behind our choices. The zEC12 system is available since September 2012.

...read moreread less

Patent•

Apparatus, system, and method for managing eviction of data

[...]

David Nellans¹, David Atkisson¹, Jim Peterson¹, Jeremy Garff¹, Michael Zappe¹ - Show less +1 more•Institutions (1)

Wilmington University¹

31 Jan 2012

TL;DR: In this paper, an apparatus, system, and method for managing eviction of data is described. But the storage operations are associated with storage operations between a host and a backing storage device.

...read moreread less

Abstract: An apparatus, system, and method are disclosed for managing eviction of data. A cache write module stores data on a non-volatile storage device sequentially using a log-based storage structure having a head region and a tail region. A direct cache module caches data on the non-volatile storage device using the log-based storage structure. The data is associated with storage operations between a host and a backing store storage device. An eviction module evicts data of at least one region in succession from the log-based storage structure starting with the tail region and progressing toward the head region.

...read moreread less

Journal Article•DOI•

Non-monopolizable caches: Low-complexity mitigation of cache side channel attacks

[...]

Leonid Domnitser¹, Aamer Jaleel², Jason Loew¹, Nael Abu-Ghazaleh¹, Dmitry Ponomarev¹ - Show less +1 more•Institutions (2)

Binghamton University¹, Intel²

26 Jan 2012

TL;DR: A flexibly-partitioned cache design that either drastically weakens or completely eliminates cache-based side channel attacks, and can provide strong security guarantees for the AES and Blowfish encryption algorithms.

...read moreread less

Abstract: We propose a flexibly-partitioned cache design that either drastically weakens or completely eliminates cache-based side channel attacks. The proposed Non-Monopolizable (NoMo) cache dynamically reserves cache lines for active threads and prevents other co-executing threads from evicting reserved lines. Unreserved lines remain available for dynamic sharing among threads. NoMo requires only simple modifications to the cache replacement logic, making it straightforward to adopt. It requires no software support enabling it to automatically protect pre-existing binaries. NoMo results in performance degradation of about 1p on average. We demonstrate that NoMo can provide strong security guarantees for the AES and Blowfish encryption algorithms.

...read moreread less

Posted Content•

A versatile and accurate approximation for LRU cache performance

[...]

Christine Fricker¹, Philippe Robert¹, James Roberts¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

17 Feb 2012-arXiv: Networking and Internet Architecture

TL;DR: Che et al. as discussed by the authors proposed a simple approach for estimating the hit rates of a cache operating the least recently used (LRU) replacement policy, which proves remarkably accurate and is applicable to quite general distributions of object popularity.

...read moreread less

Abstract: In a 2002 paper, Che and co-authors proposed a simple approach for estimating the hit rates of a cache operating the least recently used (LRU) replacement policy. The approximation proves remarkably accurate and is applicable to quite general distributions of object popularity. This paper provides a mathematical explanation for the success of the approximation, notably in configurations where the intuitive arguments of Che, et al clearly do not apply. The approximation is particularly useful in evaluating the performance of current proposals for an information centric network where other approaches fail due to the very large populations of cacheable objects to be taken into account and to their complex popularity law, resulting from the mix of different content types and the filtering effect induced by the lower layers in a cache hierarchy.

...read moreread less

Proceedings Article•DOI•

A Trace-Driven Analysis of Caching in Content-Centric Networks

[...]

Gareth Tyson¹, Sebastian Kaune, Simon Miles¹, Yehia Elkhatib², Andreas Mauthe², Adel Taweel¹ - Show less +2 more•Institutions (2)

King's College London¹, Lancaster University²

31 Aug 2012

TL;DR: It is found that larger cache sizes (10,000 packets) can create significant reductions in packet path lengths and extend significantly beyond that of edge caching by allowing transit ASes to also reduce traffic.

...read moreread less

Abstract: A content-centric network is one which supports host-to-content routing, rather than the host-to-host routing of the existing Internet. This paper investigates the potential of caching data at the router-level in content-centric networks. To achieve this, two measurement sets are combined to gain an understanding of the potential caching benefits of deploying content-centric protocols over the current Internet topology. The first set of measurements is a study of the BitTorrent network, which provides detailed traces of content request patterns. This is then combined with CAIDA's ITDK Internet traces to replay the content requests over a real-world topology. Using this data, simulations are performed to measure how effective content-centric networking would have been if it were available to these consumers/providers. We find that larger cache sizes (10,000 packets) can create significant reductions in packet path lengths. On average, 2.02 hops are saved through caching (a 20% reduction), whilst also allowing 11% of data requests to be maintained within the requester's AS. Importantly, we also show that these benefits extend significantly beyond that of edge caching by allowing transit ASes to also reduce traffic.

...read moreread less

Proceedings Article•DOI•

Row buffer locality aware caching policies for hybrid memories

[...]

HanBin Yoon¹, Justin Meza¹, Rachata Ausavarungnirun¹, Rachael A. Harding¹, Onur Mutlu¹ - Show less +1 more•Institutions (1)

Carnegie Mellon University¹

30 Sep 2012

TL;DR: A policy is devised that avoids accessing in PCM data that frequently causes row buffer misses because such accesses are costly in terms of both latency and energy.

...read moreread less

Abstract: Phase change memory (PCM) is a promising technology that can offer higher capacity than DRAM. Unfortunately, PCM's access latency and energy are higher than DRAM's and its endurance is lower. Many DRAM-PCM hybrid memory systems use DRAM as a cache to PCM, to achieve the low access latency and energy, and high endurance of DRAM, while taking advantage of PCM's large capacity. A key question is what data to cache in DRAM to best exploit the advantages of each technology while avoiding its disadvantages as much as possible. We propose a new caching policy that improves hybrid memory performance and energy efficiency. Our observation is that both DRAM and PCM banks employ row buffers that act as a cache for the most recently accessed memory row. Accesses that are row buffer hits incur similar latencies (and energy consumption) in DRAM and PCM, whereas accesses that are row buffer misses incur longer latencies (and higher energy consumption) in PCM. To exploit this, we devise a policy that avoids accessing in PCM data that frequently causes row buffer misses because such accesses are costly in terms of both latency and energy. Our policy tracks the row buffer miss counts of recently used rows in PCM, and caches in DRAM the rows that are predicted to incur frequent row buffer misses. Our proposed caching policy also takes into account the high write latencies of PCM, in addition to row buffer locality. Compared to a conventional DRAM-PCM hybrid memory system, our row buffer locality-aware caching policy improves system performance by 14% and energy efficiency by 10% on data-intensive server and cloud-type workloads. The proposed policy achieves 31% performance gain over an all-PCM memory system, and comes within 29% of the performance of an allDRAM memory system (not taking PCM's capacity benefit into account) on evaluated workloads.

...read moreread less

Proceedings Article•DOI•

Base-station assisted device-to-device communications for high-throughput wireless video networks

[...]

Negin Golrezaei¹, Andreas F. Molisch¹, Alexandros G. Dimakis¹•Institutions (1)

University of Southern California¹

10 Jun 2012

TL;DR: It is shown that an improvement of spectral efficiency of one to two orders of magnitude is possible, even if there is not very high redundancy in video requests, and what is the optimal collaboration distance is investigated.

...read moreread less

Abstract: We propose a new scheme for increasing the throughput of video files in cellular communications systems. This scheme exploits (i) the redundancy of user requests as well as (ii) the considerable storage capacity of smartphones and tablets. Users cache popular video files and — after receiving requests from other users — serve these requests via device-to-device localized transmissions. We investigate what is the optimal collaboration distance, trading off frequency reuse with the probability of finding a requested file within the collaboration distance. We show that an improvement of spectral efficiency of one to two orders of magnitude is possible, even if there is not very high redundancy in video requests.

...read moreread less

Journal Article•DOI•

Complete System Power Estimation Using Processor Performance Events

[...]

William L. Bircher¹, Lizy K. John²•Institutions (2)

Advanced Micro Devices¹, University of Texas at Austin²

01 Apr 2012-IEEE Transactions on Computers

TL;DR: Using measurement of actual systems running scientific, commercial and productivity workloads, power models for six subsystems on two platforms are developed and validated and it is possible to estimate system power consumption without the need for power sensing hardware.

...read moreread less

Abstract: This paper proposes the use of microprocessor performance counters for online measurement of complete system power consumption. The approach takes advantage of the "trickle-down” effect of performance events in microprocessors. While it has been known that CPU power consumption is correlated to processor performance, the use of well-known performance-related events within a microprocessor such as cache misses and DMA transactions to estimate power consumption in memory and disk and other subsystems outside of the microprocessor is new. Using measurement of actual systems running scientific, commercial and productivity workloads, power models for six subsystems (CPU, memory, chipset, I/O, disk, and GPU) on two platforms (server and desktop) are developed and validated. These models are shown to have an average error of less than nine percent per subsystem across the considered workloads. Through the use of these models and existing on-chip performance event counters, it is possible to estimate system power consumption without the need for power sensing hardware.

...read moreread less

Proceedings Article•DOI•

FlashTier: a lightweight, consistent and durable storage cache

[...]

Mohit Saxena¹, Michael M. Swift¹, Yiying Zhang¹•Institutions (1)

University of Wisconsin-Madison¹

10 Apr 2012

TL;DR: The FlashTier design addresses three limitations of using traditional SSDs for caching, a system architecture built upon solid-state cache, a flash device with an interface designed for caching that can recover from the crash of a 100GB cache in only 2.4 seconds.

...read moreread less

Abstract: The availability of high-speed solid-state storage has introduced a new tier into the storage hierarchy. Low-latency and high-IOPS solid-state drives (SSDs) cache data in front of high-capacity disks. However, most existing SSDs are designed to be a drop-in disk replacement, and hence are mismatched for use as a cache.This paper describes FlashTier, a system architecture built upon solid-state cache (SSC), a flash device with an interface designed for caching. Management software at the operating system block layer directs caching. The FlashTier design addresses three limitations of using traditional SSDs for caching. First, FlashTier provides a unified logical address space to reduce the cost of cache block management within both the OS and the SSD. Second, FlashTier provides cache consistency guarantees allowing the cached data to be used following a crash. Finally, FlashTier leverages cache behavior to silently evict data blocks during garbage collection to improve performance of the SSC.We have implemented an SSC simulator and a cache manager in Linux. In trace-based experiments, we show that FlashTier reduces address translation space by 60% and silent eviction improves performance by up to 167%. Furthermore, FlashTier can recover from the crash of a 100GB cache in only 2.4 seconds.

...read moreread less

Patent•

Apparatus, system, and method for destaging cached data

[...]

David Atkisson, David Flynn

23 Jan 2012

TL;DR: In this paper, an apparatus, system, and method are disclosed for destaging cached data in a nonvolatile solid-state storage device (NVS) with a cache controller.

...read moreread less

Abstract: An apparatus, system, and method are disclosed for destaging cached data. A cache controller (116) detects one or more write requests to store data in a backing store (118). The cache controller (116) sends the write requests to a storage controller (104) for a nonvolatile solid-state storage device (102). The storage controller (104) receives the write requests and caches the data in the storage device (102) by appending the data to a log (940) of the storage device (102). The log (940) includes a sequential, log-based structure preserved in the storage device (102). The cache controller (116) receives at least a portion of the data from the storage controller (104) in an order favoring operation of the storage device (102) and destages the data to the backing store (118) in that order, which is selected so that operation of the storage device (102) is more efficient in response to destaging.

...read moreread less

Collapse