scispace - formally typeset
Search or ask a question

Showing papers on "Cache published in 2014"


Journal ArticleDOI
TL;DR: This paper proposes a novel coded caching scheme that exploits both local and global caching gains, leading to a multiplicative improvement in the peak rate compared with previously known schemes, and argues that the performance of the proposed scheme is within a constant factor of the information-theoretic optimum for all values of the problem parameters.
Abstract: Caching is a technique to reduce peak traffic rates by prefetching popular content into memories at the end users. Conventionally, these memories are used to deliver requested content in part from a locally cached copy rather than through the network. The gain offered by this approach, which we term local caching gain, depends on the local cache size (i.e., the memory available at each individual user). In this paper, we introduce and exploit a second, global, caching gain not utilized by conventional caching schemes. This gain depends on the aggregate global cache size (i.e., the cumulative memory available at all users), even though there is no cooperation among the users. To evaluate and isolate these two gains, we introduce an information-theoretic formulation of the caching problem focusing on its basic structure. For this setting, we propose a novel coded caching scheme that exploits both local and global caching gains, leading to a multiplicative improvement in the peak rate compared with previously known schemes. In particular, the improvement can be on the order of the number of users in the network. In addition, we argue that the performance of the proposed scheme is within a constant factor of the information-theoretic optimum for all values of the problem parameters.

1,857 citations


Journal ArticleDOI
TL;DR: In this article, a proactive caching mechanism is proposed to reduce peak traffic demands by proactively serving predictable user demands via caching at base stations and users' devices, and the results show that important gains can be obtained for each case study, with backhaul savings and a higher ratio of satisfied users.
Abstract: This article explores one of the key enablers of beyond 4G wireless networks leveraging small cell network deployments, proactive caching. Endowed with predictive capabilities and harnessing recent developments in storage, context awareness, and social networks, peak traffic demands can be substantially reduced by proactively serving predictable user demands via caching at base stations and users' devices. In order to show the effectiveness of proactive caching, we examine two case studies that exploit the spatial and social structure of the network, where proactive caching plays a crucial role. First, in order to alleviate backhaul congestion, we propose a mechanism whereby files are proactively cached during off-peak periods based on file popularity and correlations among user and file patterns. Second, leveraging social networks and D2D communications, we propose a procedure that exploits the social structure of the network by predicting the set of influential users to (proactively) cache strategic contents and disseminate them to their social ties via D2D communications. Exploiting this proactive caching paradigm, numerical results show that important gains can be obtained for each case study, with backhaul savings and a higher ratio of satisfied users of up to 22 and 26 percent, respectively. Higher gains can be further obtained by increasing the storage capability at the network edge.

1,157 citations


Journal ArticleDOI
TL;DR: A novel edge caching scheme based on the concept of content-centric networking or information-centric networks is proposed and evaluated, using trace-driven simulations to evaluate the performance of the proposed scheme and validate the various advantages of the utilization of caching content in 5G mobile networks.
Abstract: The demand for rich multimedia services over mobile networks has been soaring at a tremendous pace over recent years. However, due to the centralized architecture of current cellular networks, the wireless link capacity as well as the bandwidth of the radio access networks and the backhaul network cannot practically cope with the explosive growth in mobile traffic. Recently, we have observed the emergence of promising mobile content caching and delivery techniques, by which popular contents are cached in the intermediate servers (or middleboxes, gateways, or routers) so that demands from users for the same content can be accommodated easily without duplicate transmissions from remote servers; hence, redundant traffic can be significantly eliminated. In this article, we first study techniques related to caching in current mobile networks, and discuss potential techniques for caching in 5G mobile networks, including evolved packet core network caching and radio access network caching. A novel edge caching scheme based on the concept of content-centric networking or information-centric networking is proposed. Using trace-driven simulations, we evaluate the performance of the proposed scheme and validate the various advantages of the utilization of caching content in 5G mobile networks. Furthermore, we conclude the article by exploring new relevant opportunities and challenges.

1,098 citations


Proceedings Article
20 Aug 2014
TL;DR: This paper presents FLUSH+RELOAD, a cache side-channel attack technique that exploits a weakness in the Intel X86 processors to monitor access to memory lines in shared pages and recovers 96.7% of the bits of the secret key by observing a single signature or decryption round.
Abstract: Sharing memory pages between non-trusting processes is a common method of reducing the memory footprint of multi-tenanted systems In this paper we demonstrate that, due to a weakness in the Intel X86 processors, page sharing exposes processes to information leaks We present FLUSH+RELOAD, a cache side-channel attack technique that exploits this weakness to monitor access to memory lines in shared pages Unlike previous cache side-channel attacks, FLUSH+RELOAD targets the Last-Level Cache (ie L3 on processors with three cache levels) Consequently, the attack program and the victim do not need to share the execution core We demonstrate the efficacy of the FLUSH+RELOAD attack by using it to extract the private encryption keys from a victim program running GnuPG 1413 We tested the attack both between two unrelated processes in a single operating system and between processes running in separate virtual machines On average, the attack is able to recover 967% of the bits of the secret key by observing a single signature or decryption round

1,001 citations


Posted Content
TL;DR: This article explores one of the key enablers of beyond 4G wireless networks leveraging small cell network deployments, proactive caching, and proposes a mechanism whereby files are proactively cached during off-peak periods based on file popularity and correlations among user and file patterns.
Abstract: This article explores one of the key enablers of beyond $4$G wireless networks leveraging small cell network deployments, namely proactive caching. Endowed with predictive capabilities and harnessing recent developments in storage, context-awareness and social networks, peak traffic demands can be substantially reduced by proactively serving predictable user demands, via caching at base stations and users' devices. In order to show the effectiveness of proactive caching, we examine two case studies which exploit the spatial and social structure of the network, where proactive caching plays a crucial role. Firstly, in order to alleviate backhaul congestion, we propose a mechanism whereby files are proactively cached during off-peak demands based on file popularity and correlations among users and files patterns. Secondly, leveraging social networks and device-to-device (D2D) communications, we propose a procedure that exploits the social structure of the network by predicting the set of influential users to (proactively) cache strategic contents and disseminate them to their social ties via D2D communications. Exploiting this proactive caching paradigm, numerical results show that important gains can be obtained for each case study, with backhaul savings and a higher ratio of satisfied users of up to $22\%$ and $26\%$, respectively. Higher gains can be further obtained by increasing the storage capability at the network edge.

841 citations


Journal ArticleDOI
TL;DR: It is shown that an improvement of spectral efficiency of one to two orders of magnitude is possible, even if there is not very high redundancy in video requests, and even a purely random caching scheme shows only a minor performance loss.
Abstract: We propose a new scheme for increasing the throughput of video files in cellular communications systems This scheme exploits (1) the redundancy of user requests as well as (2) the considerable storage capacity of smartphones and tablets Users cache popular video files and-after receiving requests from other users-serve these requests via device-to-device localized transmissions The file placement is optimal when a central control knows a priori the locations of wireless devices when file requests occur However, even a purely random caching scheme shows only a minor performance loss compared to such a “genie-aided” scheme We then analyze the optimal collaboration distance, trading off frequency reuse with the probability of finding a requested file within the collaboration distance We show that an improvement of spectral efficiency of one to two orders of magnitude is possible, even if there is not very high redundancy in video requests

442 citations


Journal ArticleDOI
01 Aug 2014
TL;DR: In this article, the outage probability and average content delivery rate in cache-enabled small cell networks are investigated. But the authors consider the problem of caching in next generation mobile cellular networks where small base stations (SBSs) are able to store their users' content and serve them accordingly.
Abstract: We consider the problem of caching in next generation mobile cellular networks where small base stations (SBSs) are able to store their users' content and serve them accordingly. The SBSs are stochastically distributed over the plane and serve their users either from the local cache or internet via limited backhaul, depending on the availability of requested content. We model and characterize the outage probability and average content delivery rate as a function of the signal-to-interference-ratio (SINR), base station intensity, target file bitrate, storage size and file popularity. Our results provide key insights into the problem of cache-enabled small cell networks.

403 citations


Proceedings ArticleDOI
10 Jun 2014
TL;DR: In this article, the authors studied the optimal cache content placement in a wireless small cell base station (sBS) with limited backhaul capacity, where the cache content content placement is optimized based on the demand history.
Abstract: Optimal cache content placement in a wireless small cell base station (sBS) with limited backhaul capacity is studied. The sBS has a large cache memory and provides content-level selective offloading by delivering high data rate contents to users in its coverage area. The goal of the sBS content controller (CC) is to store the most popular contents in the sBS cache memory such that the maximum amount of data can be fetched directly form the sBS, not relying on the limited backhaul resources during peak traffic periods. If the popularity profile is known in advance, the problem reduces to a knapsack problem. However, it is assumed in this work that, the popularity profile of the files is not known by the CC, and it can only observe the instantaneous demand for the cached content. Hence, the cache content placement is optimised based on the demand history. By refreshing the cache content at regular time intervals, the CC tries to learn the popularity profile, while exploiting the limited cache capacity in the best way possible. Three algorithms are studied for this cache content placement problem, leading to different exploitation-exploration trade-offs. We provide extensive numerical simulations in order to study the time-evolution of these algorithms, and the impact of the system parameters, such as the number of files, the number of users, the cache size, and the skewness of the popularity profile, on the performance. It is shown that the proposed algorithms quickly learn the popularity profile for a wide range of system parameters.

322 citations


Journal ArticleDOI
TL;DR: This paper proposes RAN-aware reactive and proactive caching policies that utilize User Preference Profiles (UPPs) of active users in a cell and proposes video-aware backhaul and wireless channel scheduling techniques that ensure maximizing the number of concurrent video sessions that can be supported by the end-to-end network while satisfying their initial delay requirements and minimize stalling.
Abstract: In this paper, we introduce distributed caching of videos at the base stations of the Radio Access Network (RAN) to significantly improve the video capacity and user experience of mobile networks. To ensure effectiveness of the massively distributed but relatively small-sized RAN caches, unlike Internet content delivery networks (CDNs) that can store millions of videos in a relatively few large-sized caches, we propose RAN-aware reactive and proactive caching policies that utilize User Preference Profiles (UPPs) of active users in a cell. Furthermore, we propose video-aware backhaul and wireless channel scheduling techniques that, in conjunction with edge caching, ensure maximizing the number of concurrent video sessions that can be supported by the end-to-end network while satisfying their initial delay requirements and minimize stalling. To evaluate our proposed techniques, we developed a statistical simulation framework using MATLAB and performed extensive simulations under various cache sizes, video popularity and UPP distributions, user dynamics, and wireless channel conditions. Our simulation results show that RAN caches using UPP-based caching policies, together with video-aware backhaul scheduling, can improve capacity by 300% compared to having no RAN caches, and by more than 50% compared to RAN caches using conventional caching policies. The results also demonstrate that using UPP-based RAN caches can significantly improve the probability that video requests experience low initial delays. In networks where the wireless channel bandwidth may be constrained, application of our videoaware wireless channel scheduler results in significantly (up to 250%) higher video capacity with very low stalling probability.

272 citations


Proceedings ArticleDOI
24 Sep 2014
TL;DR: Caching content opportunistically only near its consumers is shown to outperform the traditional on-path caching approach assumed in most ICN architectures in an unstructured network with arbitrary topology represented as a random geometric graph.
Abstract: A formal framework is presented for the characterization of cache allocation models in Information-Centric Networks (ICN). The framework is used to compare the performance of optimal caching everywhere in an ICN with opportunistic caching of content only near its consumers. This comparison is made using the independent reference model adopted in all prior studies, as well as a new model that captures non-stationary reference locality in space and time. The results obtained analytically and from simulations show that optimal caching throughout an ICN and opportunistic caching at the edge routers of an ICN perform comparably the same. In addition caching content opportunistically only near its consumers is shown to outperform the traditional on-path caching approach assumed in most ICN architectures in an unstructured network with arbitrary topology represented as a random geometric graph.

257 citations


Proceedings ArticleDOI
11 Nov 2014
TL;DR: A novel cache language model that consists of both an n-gram and an added “ cache" component to exploit localness, which finds that human-written programs are localized: they have useful local regularities that can be captured and exploited.
Abstract: The n-gram language model, which has its roots in statistical natural language processing, has been shown to successfully capture the repetitive and predictable regularities (“naturalness") of source code, and help with tasks such as code suggestion, porting, and designing assistive coding devices. However, we show in this paper that this natural-language-based model fails to exploit a special property of source code: localness. We find that human-written programs are localized: they have useful local regularities that can be captured and exploited. We introduce a novel cache language model that consists of both an n-gram and an added “cache" component to exploit localness. We show empirically that the additional cache component greatly improves the n-gram approach by capturing the localness of software, as measured by both cross-entropy and suggestion accuracy. Our model’s suggestion accuracy is actually comparable to a state-of-the-art, semantically augmented language model; but it is simpler and easier to implement. Our cache language model requires nothing beyond lexicalization, and thus is applicable to all programming languages.

Book ChapterDOI
17 Sep 2014
TL;DR: In this article, a cache-based attack on OpenSSL AES implementation running on VMware VMs is presented, which takes only in the order of seconds to minutes to succeed in a cross-VM setting.
Abstract: In cloud computing, efficiencies are reaped by resource sharing such as co-location of computation and deduplication of data. This work exploits resource sharing in virtualization software to build a powerful cache-based attack on AES. We demonstrate the vulnerability by mounting Cross-VM Flush+Reload cache attacks in VMware VMs to recover the keys of an AES implementation of OpenSSL 1.0.1 running inside the victim VM. Furthermore, the attack works in a realistic setting where different VMs are located on separate cores. The modified flush+reload attack we present, takes only in the order of seconds to minutes to succeed in a cross-VM setting. Therefore long term co-location, as required by other fine grain attacks in the literature, are not needed. The results of this study show that there is a great security risk to OpenSSL AES implementation running on VMware cloud services when the deduplication is not disabled.

Proceedings ArticleDOI
13 Dec 2014
TL;DR: A novel random fill cache architecture is proposed that replaces demand fetch with random cache fill within a configurable neighborhood window and shows that it provides information-theoretic security against reuse based attacks.
Abstract: Correctly functioning caches have been shown to leak critical secrets like encryption keys, through various types of cache side-channel attacks. This nullifies the security provided by strong encryption and allows confidentiality breaches, impersonation attacks and fake services. Hence, future cache designs must consider security, ideally without degrading performance and power efficiency. We introduce a new classification of cache side channel attacks: contention based attacks and reuse based attacks. Previous secure cache designs target only contention based attacks, and we show that they cannot defend against reuse based attacks. We show the surprising insight that the fundamental demand fetch policy of a cache is a security vulnerability that causes the success of reuse based attacks. We propose a novel random fill cache architecture that replaces demand fetch with random cache fill within a configurable neighborhood window. We show that our random fill cache does not degrade performance, and in fact, improves the performance for some types of applications. We also show that it provides information-theoretic security against reuse based attacks.

Posted Content
TL;DR: The results of this study show that there is a great security risk to OpenSSL AES implementation running on VMware cloud services when the deduplication is not disabled.
Abstract: In cloud computing, efficiencies are reaped by resource sharing such as co-location of computation and deduplication of data. This work exploits resource sharing in virtualization software to build a powerful cache-based attack on AES. We demonstrate the vulnerability by mounting Cross-VM Flush+Reload cache attacks in VMware VMs to recover the keys of an AES implementation of OpenSSL 1.0.1 running inside the victim VM. Furthermore, the attack works in a realistic setting where different VMs are located on separate cores. The modified flush+reload attack we present, takes only in the order of seconds to minutes to succeed in a cross-VM setting. Therefore long term co-location, as required by other fine grain attacks in the literature, are not needed. The results of this study show that there is a great security risk to OpenSSL AES implementation running on VMware cloud services when the deduplication is not disabled.

Proceedings ArticleDOI
08 Jul 2014
TL;DR: In this paper, the authors propose a unified methodology to analyse the performance of caches (both isolated and interconnected), by extending and generalizing a decoupling technique originally known as Che's approximation, which provides very accurate results at low computational cost.
Abstract: We propose a unified methodology to analyse the performance of caches (both isolated and interconnected), by extending and generalizing a decoupling technique originally known as Che's approximation, which provides very accurate results at low computational cost. We consider several caching policies, taking into account the effects of temporal locality. In the case of interconnected caches, our approach allows us to do better than the Poisson approximation commonly adopted in prior work. Our results, validated against simulations and trace-driven experiments, provide interesting insights into the performance of caching systems.

Proceedings ArticleDOI
22 Aug 2014
TL;DR: This work defines a hardware-software hybrid switch design that relies on rule caching to provide large rule tables at low cost and ``splice'' long dependency chains to cache smaller groups of rules while preserving the semantics of the network policy.
Abstract: Software-Defined Networking (SDN) enables fine-grained policies for firewalls, load balancers, routers, traffic monitoring, and other functionality. While Ternary Content Addressable Memory (TCAM) enables OpenFlow switches to process packets at high speed based on multiple header fields, today's commodity switches support just thousands to tens of thousands of rules. To realize the potential of SDN on this hardware, we need efficient ways to support the abstraction of a switch with arbitrarily large rule tables. To do so, we define a hardware-software hybrid switch design that relies on rule caching to provide large rule tables at low cost. Unlike traditional caching solutions, we neither cache individual rules (to respect rule dependencies) nor compress rules (to preserve the per-rule traffic counts). Instead we ``splice'' long dependency chains to cache smaller groups of rules while preserving the semantics of the network policy. Our design satisfies four core criteria: (1) elasticity (combining the best of hardware and software switches), (2) transparency (faithfully supporting native OpenFlow semantics, including traffic counters), (3) fine-grained rule caching (placing popular rules in the TCAM, despite dependencies on less-popular rules), and (4) adaptability (to enable incremental changes to the rule caching as the policy changes).

Proceedings ArticleDOI
14 Apr 2014
TL;DR: This work develops multiple algorithms for caching in these CDNs, using anonymized actual data from a large-scale, global CDN to evaluate the algorithms and draw conclusions on their suitability for different settings.
Abstract: Planet-scale video Content Delivery Networks (CDNs) deliver a significant fraction of the entire Internet traffic. Effective caching at the edge is vital for the feasibility of these CDNs, which can otherwise incur significant monetary costs and resource overloads in the Internet.We analyze the challenges and requirements for video caching on these CDNs which cannot be addressed by standard solutions. We develop multiple algorithms for caching in these CDNs: (i) An LRU-based baseline solution to address the requirements, (ii) an intelligent ingress-efficient algorithm, (iii) an offline cache aware of future requests (greedy) to estimate the maximum caching efficiency we can expect from any online algorithm, and (iv) an optimal offline cache (for limited scales). We use anonymized actual data from a large-scale, global CDN to evaluate the algorithms and draw conclusions on their suitability for different settings.

Proceedings ArticleDOI
24 Feb 2014
TL;DR: This work proposes Ubik, a dynamic partitioning technique that predicts and exploits the transient behavior of latency-critical workloads to maintain their tail latency while maximizing the cache space available to batch applications.
Abstract: Chip-multiprocessors (CMPs) must often execute workload mixes with different performance requirements. On one hand, user-facing, latency-critical applications (e.g., web search) need low tail (i.e., worst-case) latencies, often in the millisecond range, and have inherently low utilization. On the other hand, compute-intensive batch applications (e.g., MapReduce) only need high long-term average performance. In current CMPs, latency-critical and batch applications cannot run concurrently due to interference on shared resources. Unfortunately, prior work on quality of service (QoS) in CMPs has focused on guaranteeing average performance, not tail latency. In this work, we analyze several latency-critical workloads, and show that guaranteeing average performance is insufficient to maintain low tail latency, because microarchitectural resources with state, such as caches or cores, exert inertia on instantaneous workload performance. Last-level caches impart the highest inertia, as workloads take tens of milliseconds to warm them up. When left unmanaged, or when managed with conventional QoS frameworks, shared last-level caches degrade tail latency significantly. Instead, we propose Ubik, a dynamic partitioning technique that predicts and exploits the transient behavior of latency-critical workloads to maintain their tail latency while maximizing the cache space available to batch applications. Using extensive simulations, we show that, while conventional QoS frameworks degrade tail latency by up to 2.3x, Ubik simultaneously maintains the tail latency of latency-critical workloads and significantly improves the performance of batch applications.

Proceedings ArticleDOI
13 Dec 2014
TL;DR: Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities and employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads.
Abstract: Recent research advocates large die-stacked DRAM caches in many core servers to break the memory latency and bandwidth wall. To realize their full potential, die-stacked DRAM caches necessitate low lookup latencies, high hit rates and the efficient use of off-chip bandwidth. Today's stacked DRAM cache designs fall into two categories based on the granularity at which they manage data: block-based and page-based. The state-of-the-art block-based design, called Alloy Cache, collocates a tag with each data block (e.g., 64B) in the stacked DRAM to provide fast access to data in a single DRAM access. However, such a design suffers from low hit rates due to poor temporal locality in the DRAM cache. In contrast, the state-of-the-art page-based design, called Footprint Cache, organizes the DRAM cache at page granularity (e.g., 4KB), but fetches only the blocks that will likely be touched within a page. In doing so, the Footprint Cache achieves high hit rates with moderate on-chip tag storage and reasonable lookup latency. However, multi-gigabyte stacked DRAM caches will soon be practical and needed by server applications, thereby mandating tens of MBs of tag storage even for page-based DRAM caches. We introduce a novel stacked-DRAM cache design, Unison Cache. Similar to Alloy Cache's approach, Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities. Then, leveraging the insights from the Footprint Cache design, Unison Cache employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads, while predicting and fetching only the useful blocks within each page to minimize the off-chip traffic. Our evaluation using server workloads and caches of up to 8GB reveals that Unison cache improves performance by 14% compared to Alloy Cache due to its high hit rate, while outperforming the state-of-the art page-based designs that require impractical SRAM-based tags of around 50MB.

Proceedings ArticleDOI
23 Oct 2014
TL;DR: This paper proposes a coded caching framework, where the sBSs learn the popularity profile of the files (based on their demand history) via a combinatorial multi-armed bandit framework and modeled as a linear program that takes into account the network connectivity and thereby jointly designs the caching strategies.
Abstract: Caching has emerged as a vital tool in modern communication systems for reducing peak data rates by allowing popular files to be pre-fetched and stored locally at end users' devices. With the shift in paradigm from homogeneous cellular networks to the heterogeneous ones, the concept of data offloading to small cell base stations (sBS) has garnered significant attention. Caching at these small cell base stations has recently been proposed, where popular files are pre-fetched and stored locally in order to avoid bottlenecks in the limited capacity backhaul connection link to the core network. In this paper, we study distributed caching strategies in such a heterogeneous small cell wireless network from a reinforcement learning perspective. Using state of the art results, it can be shown that the optimal joint cache content placement in the sBSs turns out to be a NP-hard problem even when the sBS's are aware of the popularity profile of the files that are to be cached. To address this problem, we propose a coded caching framework, where the sBSs learn the popularity profile of the files (based on their demand history) via a combinatorial multi-armed bandit framework. The sBSs then pre-fetch segments of the Fountain-encoded versions of the popular files at regular intervals to serve users' requests. We show that the proposed coded caching framework can be modeled as a linear program that takes into account the network connectivity and thereby jointly designs the caching strategies. Numerical results are presented to show the benefits of the joint coded caching technique over naive decentralized cache placement strategies.

Proceedings ArticleDOI
19 Jun 2014
TL;DR: This paper proposes the memory request prioritization buffer (MRPB), a hardware structure that improves caching efficiency of massively parallel workloads by applying two prioritization methods-request reordering and cache bypassing-to memory requests before they access a cache.
Abstract: Massively parallel, throughput-oriented systems such as graphics processing units (GPUs) offer high performance for a broad range of programs. They are, however, complex to program, especially because of their intricate memory hierarchies with multiple address spaces. In response, modern GPUs have widely adopted caches, hoping to providing smoother reductions in memory access traffic and latency. Unfortunately, GPU caches often have mixed or unpredictable performance impact due to cache contention that results from the high thread counts in GPUs. We propose the memory request prioritization buffer (MRPB) to ease GPU programming and improve GPU performance. This hardware structure improves caching efficiency of massively parallel workloads by applying two prioritization methods—request reordering and cache bypassing—to memory requests before they access a cache. MRPB then releases requests into the cache in a more cache-friendly order. The result is drastically reduced cache contention and improved use of the limited per-thread cache capacity. For a simulated 16KB L1 cache, MRPB improves the average performance of the entire PolyBench and Rodinia suites by 2.65× and 1.27× respectively, outperforming a state-of-the-art GPU cache management technique.

Proceedings ArticleDOI
10 Jun 2014
TL;DR: It is proved that the optimal online scheme has approximately the same performance as the optimal offline scheme, in which the cache contents can be updated based on the entire set of popular files before each new request.
Abstract: We consider a basic content distribution scenario consisting of a single origin server connected through a shared bottleneck link to a number of users each equipped with a cache of finite memory. The users issue a sequence of content requests from a set of popular files, and the goal is to operate the caches as well as the server such that these requests are satisfied with the minimum number of bits sent over the shared link. Assuming a basic Markov model for renewing the set of popular files, we characterize approximately the optimal long-term average rate of the shared link. We further prove that the optimal online scheme has approximately the same performance as the optimal offline scheme, in which the cache contents can be updated based on the entire set of popular files before each new request. To support these theoretical results, we propose an online coded caching scheme termed coded least-recently sent (LRS) and simulate it for a demand time series derived from the dataset made available by Netflix for the Netflix Prize. For this time series, we show that the proposed coded LRS algorithm significantly outperforms the popular least-recently used (LRU) caching algorithm.

Proceedings Article
20 Aug 2014
TL;DR: A simple per-core CPU state cleansing mechanism is integrated into Xen that provides further protection against side-channel attacks at little cost when used in conjunction with an MRT guarantee, and it is found that the performance impact of MRT guarantees can be very low, particularly in multi-core settings.
Abstract: Public infrastructure-as-a-service clouds, such as Amazon EC2 and Microsoft Azure allow arbitrary clients to run virtual machines (VMs) on shared physical infrastructure. This practice of multi-tenancy brings economies of scale, but also introduces the threat of malicious VMs abusing the scheduling of shared resources. Recent works have shown how to mount cross-VM side-channel attacks to steal cryptographic secrets. The straightforward solution is hard isolation that dedicates hardware to each VM. However, this comes at the cost of reduced efficiency. We investigate the principle of soft isolation: reduce the risk of sharing through better scheduling. With experimental measurements, we show that a minimum run time (MRT) guarantee for VM virtual CPUs that limits the frequency of preemptions can effectively prevent existing Prime+Probe cache-based side-channel attacks. Through experimental measurements, we find that the performance impact of MRT guarantees can be very low, particularly in multi-core settings. Finally, we integrate a simple per-core CPU state cleansing mechanism, a form of hard isolation, into Xen. It provides further protection against side-channel attacks at little cost when used in conjunction with an MRT guarantee.

Proceedings ArticleDOI
13 Dec 2014
TL;DR: This paper proposes CAMEO, a hardware-based Cache-like Memory Organization that not only makes stacked DRAM visible as part of the memory address space but also exploits data locality on a fine-grained basis and proposes a low overhead Line Location Table (LLT) that tracks the physical location of all data lines.
Abstract: This paper analyzes the trade-offs in architecting stacked DRAM either as part of main memory or as a hardware-managed cache. Using stacked DRAM as part of main memory increases the effective capacity, but obtaining high performance from such a system requires Operating System (OS) support to migrate data at a page-granularity. Using stacked DRAM as a hardware cache has the advantages of being transparent to the OS and perform data management at a line-granularity but suffers from reduced main memory capacity. This is because the stacked DRAM cache is not part of the memory address space. Ideally, we want the stacked DRAM to contribute towards capacity of main memory, and still maintain the hardware-based fine-granularity of a cache. We propose CAMEO, a hardware-based Cache-like Memory Organization that not only makes stacked DRAM visible as part of the memory address space but also exploits data locality on a fine-grained basis. CAMEO retains recently accessed data lines in stacked DRAM and swaps out the victim line to off chip memory. Since CAMEO can change the physical location of a line dynamically, we propose a low overhead Line Location Table (LLT) that tracks the physical location of all data lines. We also propose an accurate Line Location Predictor (LLP) to avoid the serialization of the LLT look-up and memory access. We evaluate a system that has 4GB stacked memory and 12GB off-chip memory. Using stacked DRAM as a cache improves performance by 50%, using as part of main memory improves performance by 33%, whereas CAMEO improves performance by 78%. Our proposed design is very close to an idealized memory system that uses the 4GB stacked DRAM as a hardware-managed cache and also increases the main memory capacity by an additional 4GB.

Proceedings ArticleDOI
01 Oct 2014
TL;DR: This paper introduces a new mechanism, called Loose-Ordering Consistency (LOC), that satisfies the ordering requirements of persistent memory writes at significantly lower performance degradation than state-of-the-art mechanisms.
Abstract: Emerging non-volatile memory (NVM) technologies enable data persistence at the main memory level at access speeds close to DRAM In such persistent memories, memory writes need to be performed in strict order to satisfy storage consistency requirements and enable correct recovery from system crashes Unfortunately, adhering to a strict order for writes to persistent memory significantly degrades system performance as it requires flushing dirty data blocks from CPU caches and waiting for their completion at the main memory in the order specified by the program This paper introduces a new mechanism, called Loose-Ordering Consistency (LOC), that satisfies the ordering requirements of persistent memory writes at significantly lower performance degradation than state-of-the-art mechanisms LOC consists of two key techniques First, Eager Commit reduces the commit overhead for writes within a transaction by eliminating the need to perform a persistent commit record write at the end of a transaction We do so by ensuring that we can determine the status of all committed transactions during recovery by storing necessary metadata information statically with blocks of data written to memory Second, Speculative Persistence relaxes the ordering of writes between transactions by allowing writes to be speculatively written to persistent memory A speculative write is made visible to software only after its associated transaction commits To enable this, our mechanism requires the tracking of committed transaction ID and support for multi-versioning in the CPU cache Our evaluations show that LOC reduces the average performance overhead of strict write ordering from 669% to 349% on a variety of workloads

Proceedings ArticleDOI
08 Jul 2014
TL;DR: In this paper, the authors consider a network consisting of a file server connected through a shared link to a number of users, each equipped with a cache and show that caching only the most popular files can be highly suboptimal.
Abstract: We consider a network consisting of a file server connected through a shared link to a number of users, each equipped with a cache. Knowing the popularity distribution of the files, the goal is to optimally populate the caches, such as to minimize the expected load of the shared link. For a single cache, it is well known that storing the most popular files is optimal in this setting. However, we show here that this is no longer the case for multiple caches. Indeed, caching only the most popular files can be highly suboptimal. Instead, a fundamentally different approach is needed, in which the cache contents are used as side information for coded communication over the shared link. We propose such a coded caching scheme and prove that it is close to optimal.

Proceedings ArticleDOI
13 Dec 2014
TL;DR: A specialized cache management policy for GPGPUs is proposed that is coordinated with warp throttling to dynamically control the active number of warps and a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources.
Abstract: With the SIMT execution model, GPUs can hidememory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been introduced to GPU architectures to capture temporal and spatial locality and mitigate the effect of irregular accesses. However, GPU caches exhibit poor efficiency due to the mismatch of the throughput-oriented execution model and its cache hierarchy design, which limits system performance and energy-efficiency. The massive amount of memory requests generated by GPU scause cache contention and resource congestion. Existing CPUcache management policies that are designed for multicoresystems, can be suboptimal when directly applied to GPUcaches. We propose a specialized cache management policy for GPGPUs. The cache hierarchy is protected from contention by the bypass policy based on reuse distance. Contention and resource congestion are detected at runtime. To avoid oversaturatingon-chip resources, the bypass policy is coordinated with warp throttling to dynamically control the active number of warps. We also propose a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources. Experimental results show that cache efficiency is significantly improved and on-chip resources are better utilized for cache sensitive benchmarks. This results in a harmonic mean IPCimprovement of 74% and 17% (maximum 661% and 44% IPCimprovement), compared to the baseline GPU architecture and optimal static warp throttling, respectively.

Posted Content
TL;DR: In this article, the authors illustrate a vulnerability introduced by elliptic curve cryptographic protocols when implemented using a function of the OpenSSL cryptographic library, and demonstrate that the majority of the bits of a scalar k when kG is computed using the openSSL implementation of the Montgomery ladder can be recovered.
Abstract: We illustrate a vulnerability introduced to elliptic curve cryptographic protocols when implemented using a function of the OpenSSL cryptographic library. For the given implementation using an elliptic curve E over a binary field with a point G ∈ E, our attack recovers the majority of the bits of a scalar k when kG is computed using the OpenSSL implementation of the Montgomery ladder. For the Elliptic Curve Digital Signature Algorithm (ECDSA) the scalar k is intended to remain secret. Our attack recovers the scalar k and thus the secret key of the signer and would therefore allow unlimited forgeries. This is possible from snooping on only one signing process and requires computation of less than one second on a quad core desktop when the scalar k (and secret key) is around 571 bits.

Patent
14 Jul 2014
TL;DR: In this article, cache optimization techniques are employed to organize resources within caches such that the most requested content (e.g., the most popular content) is more readily available, and the resources propagate through a cache server hierarchy associated with the service provider.
Abstract: Resource management techniques, such as cache optimization, are employed to organize resources within caches such that the most requested content (e.g., the most popular content) is more readily available. A service provider utilizes content expiration data as indicative of resource popularity. As resources are requested, the resources propagate through a cache server hierarchy associated with the service provider. More frequently requested resources are maintained at edge cache servers based on shorter expiration data that is reset with each repeated request. Less frequently requested resources are maintained at higher levels of a cache server hierarchy based on longer expiration data associated with cache servers higher on the hierarchy.

Proceedings ArticleDOI
24 Aug 2014
TL;DR: A memory management framework called COLORIS, which provides support for both static and dynamic cache partitioning using page coloring and monitors the cache miss rates of running applications and triggers re-partitioning of the cache to prevent miss rates exceeding applications-specific ranges.
Abstract: Shared caches in multicore processors are subject to contention from co-running threads. The resultant interference can lead to highly-variable performance for individual applications. This is particularly problematic for real-time applications, requiring predictable timing guarantees. Previous work has applied page coloring techniques to partition a shared cache, so that conflict misses are minimized amongst co-running workloads. However, prior page coloring techniques have not addressed the problem of partitioning a cache on over-committed processors where there are more executable threads than cores. Similarly, page coloring techniques have not proven efficient at adapting the cache partition sizes for threads with varying memory demands. This paper presents a memory management framework called COLORIS, which provides support for both static and dynamic cache partitioning using page coloring. COLORIS supports novel policies to reconfigure the assignment of page colors amongst application threads in over-committed systems. For quality-of-service (QoS), COLORIS monitors the cache miss rates of running applications and triggers re-partitioning of the cache to prevent miss rates exceeding applications-specific ranges. This paper presents the design and evaluation of COLORIS as applied to Linux. We show the efficiency and effectiveness of COLORIS to color memory pages for a set of SPEC CPU2006 workloads, thereby enhancing performance isolation over existing page coloring techniques.