scispace - formally typeset
Search or ask a question

Showing papers on "Cache published in 2016"


Proceedings Article
15 Feb 2016
TL;DR: Deep Compression as mentioned in this paper proposes a three-stage pipeline: pruning, quantization, and Huffman coding to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.
Abstract: Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, we introduce "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman coding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x, from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG-16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory. Our compression method also facilitates the use of complex neural networks in mobile applications where application size and download bandwidth are constrained. Benchmarked on CPU, GPU and mobile GPU, compressed network has 3x to 4x layerwise speedup and 3x to 7x better energy efficiency.

7,256 citations


Journal ArticleDOI
TL;DR: This paper presents a content-centric transmission design in a cloud radio access network by incorporating multicasting and caching, and reformulates an equivalent sparse multicast beamforming (SBF) problem, transformed into the difference of convex programs and effectively solved using the convex-concave procedure algorithms.
Abstract: This paper presents a content-centric transmission design in a cloud radio access network by incorporating multicasting and caching. Users requesting the same content form a multicast group and are served by a same cluster of base stations (BSs) cooperatively. Each BS has a local cache, and it acquires the requested contents either from its local cache or from the central processor via backhaul links. We investigate the dynamic content-centric BS clustering and multicast beamforming with respect to both channel condition and caching status. We first formulate a mixed-integer nonlinear programming problem of minimizing the weighted sum of backhaul cost and transmit power under the quality-of-service constraint for each multicast group. Theoretical analysis reveals that all the BSs caching a requested content can be included in the BS cluster of this content, regardless of the channel conditions. Then, we reformulate an equivalent sparse multicast beamforming (SBF) problem. By adopting smoothed $\ell _{0}$ -norm approximation and other techniques, the SBF problem is transformed into the difference of convex programs and effectively solved using the convex-concave procedure algorithms. Simulation results demonstrate significant advantage of the proposed content-centric transmission. The effects of heuristic caching strategies are also evaluated.

468 citations


Book ChapterDOI
07 Jul 2016
TL;DR: The Flush+Flush attack as mentioned in this paper uses the execution time of the flush instruction, which depends on whether data is cached or not, to reduce the number of cache misses.
Abstract: Research on cache attacks has shown that CPU caches leak significant information. Proposed detection mechanisms assume that all cache attacks cause more cache hits and cache misses than benign applications and use hardware performance counters for detection. In this article, we show that this assumption does not hold by developing a novel attack technique: the Flush+Flush attack. The Flush+Flush attack only relies on the execution time of the flush instruction, which depends on whether data is cached or not. Flush+Flush does not make any memory accesses, contrary to any other cache attack. Thus, it causes no cache misses at all and the number of cache hits is reduced to a minimum due to the constant cache flushes. Therefore, Flush+Flush attacks are stealthy, i.e., the spy process cannot be detected based on cache hits and misses, or state-of-the-art detection mechanisms. The Flush+Flush attack runs in a higher frequency and thus is faster than any existing cache attack. With 496i?źKB/s in a cross-core covert channel it is 6.7 times faster than any previously published cache covert channel.

416 citations


Journal ArticleDOI
TL;DR: Preliminary efforts on developing and optimizing applications on the TaihuLight system are reported, focusing on key application domains, such as earth system modeling, ocean surface wave modeling, atomistic simulation, and phase-field simulation.
Abstract: The Sunway TaihuLight supercomputer is the worlds first system with a peak performance greater than 100 PFlops. In this paper, we provide a detailed introduction to the TaihuLight system. In contrast with other existing heterogeneous supercomputers, which include both CPU processors and PCIe-connected many-core accelerators (NVIDIA GPU or Intel Xeon Phi), the computing power of TaihuLight is provided by a homegrown many-core SW26010 CPU that includes both the management processing elements (MPEs) and computing processing elements (CPEs) in one chip. With 260 processing elements in one CPU, a single SW26010 provides a peak performance of over three TFlops. To alleviate the memory bandwidth bottleneck in most applications, each CPE comes with a scratch pad memory, which serves as a user-controlled cache. To support the parallelization of programs on the new many-core architecture, in addition to the basic C/C++ and Fortran compilers, the system provides a customized Sunway OpenACC tool that supports the OpenACC 2.0 syntax. This paper also reports our preliminary efforts on developing and optimizing applications on the TaihuLight system, focusing on key application domains, such as earth system modeling, ocean surface wave modeling, atomistic simulation, and phase-field simulation.

394 citations


Proceedings ArticleDOI
12 Mar 2016
TL;DR: CATalyst, a pseudo-locking mechanism which uses CAT to partition the LLC into a hybrid hardware-software managed cache, is presented, and it is shown that LLC side channel attacks can be defeated.
Abstract: Cache side channel attacks are serious threats to multi-tenant public cloud platforms. Past work showed how secret information in one virtual machine (VM) can be extracted by another co-resident VM using such attacks. Recent research demonstrated the feasibility of high-bandwidth, low-noise side channel attacks on the last-level cache (LLC), which is shared by all the cores in the processor package, enabling attacks even when VMs are scheduled on different cores. This paper shows how such LLC side channel attacks can be defeated using a performance optimization feature recently introduced in commodity processors. Since most cloud servers use Intel processors, we show how the Intel Cache Allocation Technology (CAT) can be used to provide a system-level protection mechanism to defend from side channel attacks on the shared LLC. CAT is a way-based hardware cache-partitioning mechanism for enforcing quality-of-service with respect to LLC occupancy. However, it cannot be directly used to defeat cache side channel attacks due to the very limited number of partitions it provides. We present CATalyst, a pseudo-locking mechanism which uses CAT to partition the LLC into a hybrid hardware-software managed cache. We implement a proof-of-concept system using Xen and Linux running on a server with Intel processors, and show that LLC side channel attacks can be defeated. Furthermore, CATalyst only causes very small performance overhead when used for security, and has negligible impact on legacy applications.

360 citations


Book ChapterDOI
07 Jul 2016
TL;DR: This work shows that caches can be forced into fast cache eviction to trigger the Rowhammer bug with only regular memory accesses, and demonstrates a fully automated attack that requires nothing but a website with JavaScript to trigger faults on remote hardware.
Abstract: A fundamental assumption in software security is that a memory location can only be modified by processes that may write to this memory location. However, a recent study has shown that parasitic effects in DRAM can change the content of a memory cell without accessing it, but by accessing other memory locations in a high frequency. This so-called Rowhammer bug occurs in most of today's memory modules and has fatal consequences for the security of all affected systems, e.g., privilege escalation attacks. All studies and attacks related to Rowhammer so far rely on the availability of a cache flush instruction in order to cause accesses to DRAM modules at a sufficiently high frequency. We overcome this limitation by defeating complex cache replacement policies. We show that caches can be forced into fast cache eviction to trigger the Rowhammer bug with only regular memory accesses. This allows to trigger the Rowhammer bug in highly restricted and even scripting environments. We demonstrate a fully automated attack that requires nothing but a website with JavaScript to trigger faults on remote hardware. Thereby we can gain unrestricted access to systems of website visitors. We show that the attack works on off-the-shelf systems. Existing countermeasures fail to protect against this new Rowhammer attack.

296 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed and analyzed cache-based content delivery in a three-tier heterogeneous network (HetNet), where base stations (BSs), relays, and device-to-device (D2D) pairs are included.
Abstract: Caching popular multimedia content is a promising way to unleash the ultimate potential of wireless networks. In this paper, we propose and analyze cache-based content delivery in a three-tier heterogeneous network (HetNet), where base stations (BSs), relays, and device-to-device (D2D) pairs are included. We advocate proactively caching popular content in the relays and parts of the users with caching ability when the network is off-peak. The cached content can be reused for frequent access to offload the cellular network traffic. The node locations are first modeled as mutually independent Poisson point processes (PPPs) and the corresponding content access protocol is developed. The average ergodic rate and outage probability in the downlink are then analyzed theoretically. We further derive the throughput and the delay based on the multiclass processor-sharing queue model and the continuous-time Markov process. According to the critical condition of the steady state in the HetNet, the maximum traffic load and the global throughput gain are investigated. Moreover, impacts of some key network characteristics, e.g., the heterogeneity of multimedia contents, node densities, and the limited caching capacities, on the system performance are elaborated on to provide valuable insight.

293 citations


Posted Content
TL;DR: This article propose an extension to neural network language models to adapt their prediction to the recent history, which stores past hidden activations as memory and accesses them through a dot product with the current hidden activation.
Abstract: We propose an extension to neural network language models to adapt their prediction to the recent history. Our model is a simplified version of memory augmented networks, which stores past hidden activations as memory and accesses them through a dot product with the current hidden activation. This mechanism is very efficient and scales to very large memory sizes. We also draw a link between the use of external memory in neural network and cache models used with count based language models. We demonstrate on several language model datasets that our approach performs significantly better than recent memory augmented networks.

264 citations


Journal ArticleDOI
TL;DR: This work proposes an online coded caching scheme termed coded least-recently sent (LRS) and simulates it for a demand time series derived from the dataset made available by Netflix for the Netflix Prize, showing that the proposed coded LRS algorithm significantly outperforms the popular least- recently used caching algorithm.
Abstract: We consider a basic content distribution scenario consisting of a single origin server connected through a shared bottleneck link to a number of users each equipped with a cache of finite memory. The users issue a sequence of content requests from a set of popular files, and the goal is to operate the caches as well as the server such that these requests are satisfied with the minimum number of bits sent over the shared link. Assuming a basic Markov model for renewing the set of popular files, we characterize approximately the optimal long-term average rate of the shared link. We further prove that the optimal online scheme has approximately the same performance as the optimal offline scheme, in which the cache contents can be updated based on the entire set of popular files before each new request. To support these theoretical results, we propose an online coded caching scheme termed coded least-recently sent (LRS) and simulate it for a demand time series derived from the dataset made available by Netflix for the Netflix Prize. For this time series, we show that the proposed coded LRS algorithm significantly outperforms the popular least-recently used caching algorithm.

249 citations


Journal ArticleDOI
TL;DR: It is shown that the multicast-aware caching problem is NP-hard and solutions with performance guarantees using randomized-rounding techniques are developed, showing that in the presence of massive demand for delay tolerant content, combining caching and multicast can indeed reduce energy costs.
Abstract: The landscape toward 5G wireless communication is currently unclear, and, despite the efforts of academia and industry in evolving traditional cellular networks, the enabling technology for 5G is still obscure. This paper puts forward a network paradigm toward next-generation cellular networks, targeting to satisfy the explosive demand for mobile data while minimizing energy expenditures. The paradigm builds on two principles; namely caching and multicast . On one hand, caching policies disperse popular content files at the wireless edge, e.g., pico-cells and femto-cells, hence shortening the distance between content and requester. On other hand, due to the broadcast nature of wireless medium, requests for identical files occurring at nearby times are aggregated and served through a common multicast stream. To better exploit the available cache space, caching policies are optimized based on multicast transmissions. We show that the multicast-aware caching problem is NP-hard and develop solutions with performance guarantees using randomized-rounding techniques. Trace-driven numerical results show that in the presence of massive demand for delay tolerant content, combining caching and multicast can indeed reduce energy costs. The gains over existing caching schemes are 19% when users tolerate delay of three minutes, increasing further with the steepness of content access pattern.

241 citations


Proceedings ArticleDOI
10 Aug 2016
TL;DR: This work demonstrates how to solve key challenges to perform the most powerful cross-core cache attacks Prime+Probe, Flush+ Reload, Evict+Reload, and Flush-Flush on non-rooted ARM-based devices without any privileges.
Abstract: In the last 10 years, cache attacks on Intel x86 CPUs have gained increasing attention among the scientific community and powerful techniques to exploit cache side channels have been developed. However, modern smartphones use one or more multi-core ARM CPUs that have a different cache organization and instruction set than Intel x86 CPUs. So far, no cross-core cache attacks have been demonstrated on non-rooted Android smartphones. In this work, we demonstrate how to solve key challenges to perform the most powerful cross-core cache attacks Prime+Probe, Flush+Reload, Evict+Reload, and Flush+Flush on non-rooted ARM-based devices without any privileges. Based on our techniques, we demonstrate covert channels that outperform state-of-the-art covert channels on Android by several orders of magnitude. Moreover, we present attacks to monitor tap and swipe events as well as keystrokes, and even derive the lengths of words entered on the touchscreen. Eventually, we are the first to attack cryptographic primitives implemented in Java. Our attacks work across CPUs and can even monitor cache activity in the ARM TrustZone from the normal world. The techniques we present can be used to attack hundreds of millions of Android devices.

Proceedings ArticleDOI
01 Oct 2016
TL;DR: The In-Memory PoInter Chasing Accelerator (IMPICA), which leverages the logic layer within 3D-stacked memory for linked data structure traversal and addresses the key challenges of how to achieve high parallelism in the presence of serial accesses in pointer chasing, and how to effectively perform virtual-to-physical address translation on the memory side without requiring expensive accesses to the CPU's memory management unit.
Abstract: Pointer chasing is a fundamental operation, used by many important data-intensive applications (e.g., databases, key-value stores, graph processing workloads) to traverse linked data structures. This operation is both memory bound and latency sensitive, as it (1) exhibits irregular access patterns that cause frequent cache and TLB misses, and (2) requires the data from every memory access to be sent back to the CPU to determine the next pointer to access. Our goal is to accelerate pointer chasing by performing it inside main memory, thereby avoiding inefficient and high-latency data transfers between main memory and the CPU. To this end, we propose the In-Memory PoInter Chasing Accelerator (IMPICA), which leverages the logic layer within 3D-stacked memory for linked data structure traversal.

Book ChapterDOI
19 Sep 2016
TL;DR: This work presents CloudRadar, a system to detect, and hence mitigate, cache-based side-channel attacks in multi-tenant cloud systems, designed as a lightweight patch to existing cloud systems which does not require new hardware support, or any hypervisor, operating system, application modifications.
Abstract: We present CloudRadar, a system to detect, and hence mitigate, cache-based side-channel attacks in multi-tenant cloud systems. CloudRadar operates by correlating two events: first, it exploits signature-based detection to identify when the protected virtual machine (VM) executes a cryptographic application; at the same time, it uses anomaly-based detection techniques to monitor the co-located VMs to identify abnormal cache behaviors that are typical during cache-based side-channel attacks. We show that correlation in the occurrence of these two events offer strong evidence of side-channel attacks. Compared to other work on side-channel defenses, CloudRadar has the following advantages: first, CloudRadar focuses on the root causes of cache-based side-channel attacks and hence is hard to evade using metamorphic attack code, while maintaining a low false positive rate. Second, CloudRadar is designed as a lightweight patch to existing cloud systems, which does not require new hardware support, or any hypervisor, operating system, application modifications. Third, CloudRadar provides real-time protection and can detect side-channel attacks within the order of milliseconds. We demonstrate a prototype implementation of CloudRadar in the OpenStack cloud framework. Our evaluation suggests CloudRadar achieves negligible performance overhead with high detection accuracy.

Journal ArticleDOI
TL;DR: Tractable expressions for both effective capacity and energy efficiency performance are derived and show that the proposed cluster content caching structure can improve QoS guarantees with a lower cost of local storage.
Abstract: In cloud radio access networks (C-RANs), a substantial amount of data must be exchanged in both backhaul and fronthaul links, which causes high power consumption and poor quality of service (QoS) experience for real-time services. To solve this problem, a cluster content caching structure is proposed in this paper, which takes full advantages of distributed caching and centralized signal processing. In particular, redundant traffic on the backhaul can be reduced because the cluster content cache provides a part of required content objects for remote radio heads (RRHs) connected to a common edge cloud. Tractable expressions for both effective capacity and energy efficiency performance are derived, which show that the proposed structure can improve QoS guarantees with a lower cost of local storage. Furthermore, to fully explore the potential of the proposed cluster content caching structure, the joint design of resource allocation and RRH association is optimized, and two distributed algorithms are accordingly proposed. Simulation results verify the accuracy of the analytical results and show the performance gains achieved by cluster content caching in C-RANs.

Proceedings Article
01 Jan 2016
TL;DR: In this article, the DRAM address mappings are used to reverse engineer the mapping of memory addresses to DRAM channels, ranks, and banks, and a new class of attacks, DRAMA attacks, are presented.
Abstract: In cloud computing environments, multiple tenants are often co-located on the same multi-processor system. Thus, preventing information leakage between tenants is crucial. While the hypervisor enforces software isolation, shared hardware, such as the CPU cache or memory bus, can leak sensitive information. For security reasons, shared memory between tenants is typically disabled. Furthermore, tenants often do not share a physical CPU. In this setting, cache attacks do not work and only a slow cross-CPU covert channel over the memory bus is known. In contrast, we demonstrate a high-speed covert channel as well as the first side-channel attack working across processors and without any shared memory. To build these attacks, we use the undocumented DRAM address mappings. We present two methods to reverse engineer the mapping of memory addresses to DRAM channels, ranks, and banks. One uses physical probing of the memory bus, the other runs entirely in software and is fully automated. Using this mapping, we introduce DRAMA attacks, a novel class of attacks that exploit the DRAM row buffer that is shared, even in multi-processor systems. Thus, our attacks work in the most restrictive environments. First, we build a covert channel with a capacity of up to 2Mbps, which is three to four orders of magnitude faster than memory-bus-based channels. Second, we build a side-channel template attack that can automatically locate and monitor memory accesses. Third, we show how using the DRAM mappings improves existing attacks and in particular enables practical Rowhammer attacks on DDR4.

Journal ArticleDOI
01 Dec 2016
TL;DR: This paper analyzes three methods to detect cache-based side-channel attacks in real time, preventing or limiting the amount of leaked information, and how the detection systems behave with a modified version of one of the spy processes.
Abstract: Graphical abstractDisplay Omitted HighlightsThree methods for detecting a class of cache-based side-channel attacks are proposed.A new tool (quickhpc) for probing hardware performance counters at a higher temporal resolution than the existing tools is presented.The first method is based on correlation, the other two use machine learning techniques and reach a minimum F-score of 0.93.A smarter attack is devised that is capable of circumventing the first method. In this paper we analyze three methods to detect cache-based side-channel attacks in real time, preventing or limiting the amount of leaked information. Two of the three methods are based on machine learning techniques and all the three of them can successfully detect an attack in about one fifth of the time required to complete it. We could not experience the presence of false positives in our test environment and the overhead caused by the detection systems is negligible. We also analyze how the detection systems behave with a modified version of one of the spy processes. With some optimization we are confident these systems can be used in real world scenarios.

Proceedings ArticleDOI
TL;DR: In this article, the authors considered a cache-aided wireless network with a library of files and showed that the sum degrees-of-freedom (sum-DoF) of the network is within a factor of 2 of the optimum under one-shot linear schemes.
Abstract: We consider a system comprising a library of $N$ files (e.g., movies) and a wireless network with $K_T$ transmitters, each equipped with a local cache of size of $M_T$ files, and $K_R$ receivers, each equipped with a local cache of size of $M_R$ files. Each receiver will ask for one of the $N$ files in the library, which needs to be delivered. The objective is to design the cache placement (without prior knowledge of receivers' future requests) and the communication scheme to maximize the throughput of the delivery. In this setting, we show that the sum degrees-of-freedom (sum-DoF) of $\min\left\{\frac{K_T M_T+K_R M_R}{N},K_R\right\}$ is achievable, and this is within a factor of 2 of the optimum, under one-shot linear schemes. This result shows that (i) the one-shot sum-DoF scales linearly with the aggregate cache size in the network (i.e., the cumulative memory available at all nodes), (ii) the transmitters' and receivers' caches contribute equally in the one-shot sum-DoF, and (iii) caching can offer a throughput gain that scales linearly with the size of the network. To prove the result, we propose an achievable scheme that exploits the redundancy of the content at transmitters' caches to cooperatively zero-force some outgoing interference and availability of the unintended content at receivers' caches to cancel (subtract) some of the incoming interference. We develop a particular pattern for cache placement that maximizes the overall gains of cache-aided transmit and receive interference cancellations. For the converse, we present an integer optimization problem which minimizes the number of communication blocks needed to deliver any set of requested files to the receivers. We then provide a lower bound on the value of this optimization problem, hence leading to an upper bound on the linear one-shot sum-DoF of the network, which is within a factor of 2 of the achievable sum-DoF.

Proceedings ArticleDOI
11 Sep 2016
TL;DR: In this article, when the cache contents and the user demands are fixed, the authors connect the caching problem to an index coding problem and show the optimality of the MAN scheme under the conditions that the cache placement phase is restricted to be uncoded (i.e., pieces of the files can only copied into the user's cache).
Abstract: Caching is an effective way to reduce peak-hour network traffic congestion by storing some contents at user's local cache. Maddah-Ali and Niesen (MAN) initiated a fundamental study of caching systems by proposing a scheme (with uncoded cache placement and linear network coding delivery) that is provably optimal to within a factor 4.7. In this paper, when the cache contents and the user demands are fixed, we connect the caching problem to an index coding problem and show the optimality of the MAN scheme under the conditions that (i) the cache placement phase is restricted to be uncoded (i.e, pieces of the files can only copied into the user's cache), and (ii) the number of users is no more than the number of files. As a consequence, further improvements to the MAN scheme are only possible through the use of coded cache placement.

Journal ArticleDOI
TL;DR: A review of the caching problem in ICN, with a focus on on-path caching, is provided and a detailed analysis of the existing caching policies and forwarding mechanisms that complement these policies is given.
Abstract: Information-centric networking (ICN), an alternative to the host-centric model of the current Internet infrastructure, focuses on the distribution and retrieval of content instead of the transfer of information between specific endpoints In order to achieve this, ICN is based on the paradigm of publish-subscribe and the concepts of naming and in-network caching Current approaches to ICN employ caches within networks to minimize the latency of information retrieval Content may be distributed either in caches along the delivery path(s), on-path caching or in any cache within a network, off-path caching While approaches to off-path caching are comparable to traditional approaches for content replication and Web caching, approaches to on-path caching are specific to the ICN area The purpose of this paper is to provide a review of the caching problem in ICN, with a focus on on-path caching To this end, a detailed analysis of the existing caching policies and forwarding mechanisms that complement these policies is given A number of criteria such as the caching model and level of operation and the evaluation parameters used in the evaluation of the existing caching policies are being employed to derive a taxonomy for on-path caching and highlight the trends and evaluation issues in this area A discussion driven by the advantages and disadvantages of the existing caching policies and the challenges and open questions in on-path caching is finally being held

Proceedings ArticleDOI
14 Mar 2016
TL;DR: This paper shows how to give applications the illusion of high-speed forwarding, large rule tables, and fast updates by combining the best of hardware and software processing.
Abstract: Software-Defined Networking (SDN) allows control applications to install fine-grained forwarding policies in the underlying switches. While Ternary Content Addressable Memory (TCAM) enables fast lookups in hardware switches with flexible wildcard rule patterns, the cost and power requirements limit the number of rules the switches can support. To make matters worse, these hardware switches cannot sustain a high rate of updates to the rule table. In this paper, we show how to give applications the illusion of high-speed forwarding, large rule tables, and fast updates by combining the best of hardware and software processing. Our CacheFlow system "caches" the most popular rules in the small TCAM, while relying on software to handle the small amount of "cache miss" traffic. However, we cannot blindly apply existing cache-replacement algorithms, because of dependencies between rules with overlapping patterns. Rather than cache large chains of dependent rules, we "splice" long dependency chains to cache smaller groups of rules while preserving the semantics of the policy. Experiments with our CacheFlow prototype---on both real and synthetic workloads and policies---demonstrate that rule splicing makes effective use of limited TCAM space, while adapting quickly to changes in the policy and the traffic demands.

Proceedings ArticleDOI
12 Mar 2016
TL;DR: This work develops a low-cost mechanism, called ChargeCache, that enables faster access to recently- accessed rows in DRAM, with no modifications to DRAM chips, based on the key observation that a recently-accessed row has more charge and thus the following access to the same row can be performed faster.
Abstract: DRAM latency continues to be a critical bottleneck for system performance. In this work, we develop a low-cost mechanism, called Charge Cache, that enables faster access to recently-accessed rows in DRAM, with no modifications to DRAM chips. Our mechanism is based on the key observation that a recently-accessed row has more charge and thus the following access to the same row can be performed faster. To exploit this observation, we propose to track the addresses of recently-accessed rows in a table in the memory controller. If a later DRAM request hits in that table, the memory controller uses lower timing parameters, leading to reduced DRAM latency. Row addresses are removed from the table after a specified duration to ensure rows that have leaked too much charge are not accessed with lower latency. We evaluate ChargeCache on a wide variety of workloads and show that it provides significant performance and energy benefits for both single-core and multi-core systems.

Proceedings ArticleDOI
25 Mar 2016
TL;DR: A software-based defense, ANVIL, is developed, which thwarts all known rowhammer attacks on existing systems and is shown to be low-cost and robust, and experiments indicate that it is an effective approach for protecting existing and future systems from even advanced rowhAMmer attacks.
Abstract: Ensuring the integrity and security of the memory system is critical. Recent studies have shown serious security concerns due to "rowhammer" attacks, where repeated accesses to a row of memory cause bit flips in adjacent rows. Recent work by Google's Project Zero has shown how to leverage rowhammer-induced bit-flips as the basis for security exploits that include malicious code injection and memory privilege escalation. Being an important security concern, industry has attempted to defend against rowhammer attacks. Deployed defenses employ two strategies: (1) doubling the system DRAM refresh rate and (2) restricting access to the CLFLUSH instruction that attackers use to bypass the cache to increase memory access frequency (i.e., the rate of rowhammering). We demonstrate that such defenses are inadequte: we implement rowhammer attacks that both avoid using the CLFLUSH instruction and cause bit flips with a doubled refresh rate. Our next-generation CLFLUSH-free rowhammer attack bypasses the cache by manipulating cache replacement state to allow frequent misses out of the last-level cache to DRAM rows of our choosing. To protect existing systems from more advanced rowhammer attacks, we develop a software-based defense, ANVIL, which thwarts all known rowhammer attacks on existing systems. ANVIL detects rowhammer attacks by tracking the locality of DRAM accesses using existing hardware performance counters. Our detector identifies the rows being frequently accessed (i.e., the aggressors), then selectively refreshes the nearby victim rows to prevent hammering. Experiments running on real hardware with the SPEC2006 benchmarks show that ANVIL has less than a 1% false positive rate and an average slowdown of 1%. ANVIL is low-cost and robust, and our experiments indicate that it is an effective approach for protecting existing and future systems from even advanced rowhammer attacks.

Journal ArticleDOI
TL;DR: In this article, the authors explore the potential of EE of cache-enabled wireless access networks and identify the key factors that contribute more to the EE gain from caching, and derive the closed-form expression of the approximated EE.
Abstract: Caching popular contents at base stations (BSs) can reduce the backhaul cost and improve the network throughput. Yet whether locally caching at the BSs can improve the energy efficiency (EE), a major goal for fifth generation cellular networks, remains unclear. Due to the entangled impact of various factors on EE such as interference level, backhaul capacity, BS density, power consumption parameters, BS sleeping, content popularity, and cache capacity, another important question is what are the key factors that contribute more to the EE gain from caching. In this paper, we attempt to explore the potential of EE of the cache-enabled wireless access networks and identify the key factors. By deriving closed-form expression of the approximated EE, we provide the condition when the EE can benefit from caching, find the optimal cache capacity that maximizes the network EE, and analyze the maximal EE gain brought by caching. We show that caching at the BSs can improve the network EE when power efficient cache hardware is used. When local caching has EE gain over not caching, caching more contents at the BSs may not provide higher EE. Numerical and simulation results show that the caching EE gain is large when the backhaul capacity is stringent, interference level is low, content popularity is skewed, and when caching at pico BSs instead of macro BSs.

Book ChapterDOI
17 Aug 2016
TL;DR: This paper argues that shared resources like the CPU, memory and even the network adapter that provide subtle side-channels to malicious parties indeed leak fine grained, sensitive information and enable key recovery attacks on the cloud.
Abstract: Cloud services keep gaining popularity despite the security concerns. While non-sensitive data is easily trusted to cloud, security critical data and applications are not. The main concern with the cloud is the shared resources like the CPU, memory and even the network adapter that provide subtle side-channels to malicious parties. We argue that these side-channels indeed leak fine grained, sensitive information and enable key recovery attacks on the cloud. Even further, as a quick scan in one of the Amazon EC2 regions shows, high percentage – 55 % – of users run outdated, leakage prone libraries leaving them vulnerable to mass surveillance.

Journal ArticleDOI
14 Jul 2016
TL;DR: Glimpse is a continuous, real-time object recognition system for camera-equipped mobile devices that captures full-motion video, locates objects of interest, recognizes and labels them, and tracks them from frame to frame for the user.
Abstract: Excerpted from "Glimpse: Continuous, Real-Time Object Recognition on Mobile Devices" from Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems with permission. http://dx.doi.org/10.1145/2809695.2809711 © ACM 2015.Glimpse is a continuous, real-time object recognition system for camera-equipped mobile devices. Glimpse captures full-motion video, locates objects of interest, recognizes and labels them, and tracks them from frame to frame for the user. Because the algorithms for object recognition entail significant computation, Glimpse runs them on server machines. When the latency between the server and mobile device is higher than a frame-time, this approach lowers object recognition accuracy. To regain accuracy, Glimpse uses an active cache of video frames on the mobile device. A subset of the frames in the active cache are used to track objects on the mobile, using (stale) hints about objects that arrive from the server from time to time. To reduce network bandwidth usage, Glimpse computes trigger frames to send to the server for recognizing.

Proceedings ArticleDOI
10 Apr 2016
TL;DR: It is proved that the learning regret of PopCaching is sublinear in the number of content requests, and it is converges fast and asymptotically achieves the optimal cache hit rate.
Abstract: This paper presents a novel cache replacement method — Popularity-Driven Content Caching (PopCaching). PopCaching learns the popularity of content and uses it to determine which content it should store and which it should evict from the cache. Popularity is learned in an online fashion, requires no training phase and hence, it is more responsive to continuously changing trends of content popularity. We prove that the learning regret of PopCaching (i.e., the gap between the hit rate achieved by PopCaching and that by the optimal caching policy with hindsight) is sublinear in the number of content requests. Therefore, PopCaching converges fast and asymptotically achieves the optimal cache hit rate. We further demonstrate the effectiveness of PopCaching by applying it to a movie.douban.com dataset that contains over 38 million requests. Our results show significant cache hit rate lift compared to existing algorithms, and the improvements can exceed 40% when the cache capacity is limited. In addition, PopCaching has low complexity.

Proceedings ArticleDOI
10 Apr 2016
TL;DR: This paper proposes an Age-Based Threshold (ABT) policy which caches all contents requested more times than a threshold N (τ), and shows that ABT is asymptotically hit rate optimal in the many contents regime, which allows the first characterization of the optimal performance of a caching system in a dynamic context.
Abstract: This paper addresses a fundamental limitation for the adoption of caching for wireless access networks due to small population sizes. This shortcoming is due to two main challenges: making timely estimates of varying content popularity and inferring popular content from small samples. We propose a framework which alleviates such limitations. To timely estimate varying popularity in a context of a single cache we propose an Age-Based Threshold (ABT) policy which caches all contents requested more times than a threshold N (τ), where τ is the content age. We show that ABT is asymptotically hit rate optimal in the many contents regime, which allows us to obtain the first characterization of the optimal performance of a caching system in a dynamic context. We then address small sample sizes focusing on L local caches and one global cache. On the one hand we show that the global cache learns L times faster by aggregating all requests from local caches, which improves hit rates. On the other hand, aggregation washes out local characteristics of correlated traffic which penalizes hit rate. This motivates coordination mechanisms which combine global learning of popularity scores in clusters and Least-Recently-Used (LRU) policy with prefetching.

Proceedings ArticleDOI
10 Apr 2016
TL;DR: This paper proposes utility-driven caching, where each content is associate with each content a utility, which is a function of the corresponding content hit probability, and develops online algorithms that can be used by service providers to implement various caching policies based on arbitrary utility functions.
Abstract: In any caching system, the admission and eviction policies determine which contents are added and removed from a cache when a miss occurs. Usually, these policies are devised so as to mitigate staleness and increase the hit probability. Nonetheless, the utility of having a high hit probability can vary across contents. This occurs, for instance, when service level agreements must be met, or if certain contents are more difficult to obtain than others. In this paper, we propose utility-driven caching, where we associate with each content a utility, which is a function of the corresponding content hit probability. We formulate optimization problems where the objectives are to maximize the sum of utilities over all contents. These problems differ according to the stringency of the cache capacity constraint. Our framework enables us to reverse engineer classical replacement policies such as LRU and FIFO, by computing the utility functions that they maximize. We also develop online algorithms that can be used by service providers to implement various caching policies based on arbitrary utility functions.

Journal ArticleDOI
18 Jun 2016
TL;DR: This paper explains how a cache replacement algorithm can nonetheless learn from Belady's algorithm by applying it to past cache accesses to inform future cache replacement decisions, and shows that the implementation is surprisingly efficient.
Abstract: Belady's algorithm is optimal but infeasible because it requires knowledge of the future. This paper explains how a cache replacement algorithm can nonetheless learn from Belady's algorithm by applying it to past cache accesses to inform future cache replacement decisions. We show that the implementation is surprisingly efficient, as we introduce a new method of efficiently simulating Belady's behavior, and we use known sampling techniques to compactly represent the long history information that is needed for high accuracy. For a 2MB LLC, our solution uses a 16KB hardware budget (excluding replacement state in the tag array). When applied to a memory-intensive subset of the SPEC 2006 CPU benchmarks, our solution improves performance over LRU by 8.4%, as opposed to 6.2% for the previous state-of-the-art. For a 4-core system with a shared 8MB LLC, our solution improves performance by 15.0%, compared to 12.0% for the previous state-of-the-art.

Book ChapterDOI
TL;DR: The Flush+Flush attack has a performance close to state-of-the-art side channels in existing cache attack scenarios, while reducing cache misses significantly below the border of detectability, in the first work discussing the stealthiness of cache attacks both from the attacker and the defender perspective.
Abstract: Research on cache attacks has shown that CPU caches leak significant information. Recent attacks either use the Flush+Reload technique on read-only shared memory or the Prime+Probe technique without shared memory, to derive encryption keys or eavesdrop on user input. Efficient countermeasures against these powerful attacks that do not cause a loss of performance are a challenge. In this paper, we use hardware performance counters as a means to detect access-based cache attacks. Indeed, existing attacks cause numerous cache references and cache misses and can subsequently be detected. We propose a new criteria that uses these events for ad-hoc detection. These findings motivate the development of a novel attack technique: the Flush+Flush attack. The Flush+Flush attack only relies on the execution time of the flush instruction, that depends on whether the data is cached or not. Like Flush+Reload, it monitors when a process loads read-only shared memory into the CPU cache. However, Flush+Flush does not have a reload step, thus causing no cache misses compared to typical Flush+Reload and Prime+Probe attacks. We show that the significantly lower impact on the hardware performance counters therefore evades detection mechanisms. The Flush+Flush attack has a performance close to state-of-the-art side channels in existing cache attack scenarios, while reducing cache misses significantly below the border of detectability. Our Flush+Flush covert channel achieves a transmission rate of 496KB/s which is 6.7 times faster than any previously published cache covert channel. To the best of our knowledge, this is the first work discussing the stealthiness of cache attacks both from the attacker and the defender perspective.