scispace - formally typeset
Search or ask a question

Showing papers on "Smart Cache published in 2013"


Journal ArticleDOI
TL;DR: This paper presents a comprehensive survey of state-of-art techniques aiming to address caching issues, with particular focus on reducing cache redundancy and improving the availability of cached content.

343 citations


Journal ArticleDOI
TL;DR: This work studies the problem of en route caching and investigates if caching in only a subset of nodes along the delivery path can achieve better performance in terms of cache and server hit rates and proposes a centrality-based caching algorithm that can consistently achieve better gain across both synthetic and real network topologies that have different structural properties.

235 citations


Proceedings ArticleDOI
03 Nov 2013
TL;DR: This paper instrumented every Facebook-controlled layer of the stack and sampled the resulting event stream to obtain traces covering over 77 million requests for more than 1 million unique photos to study traffic patterns, cache access patterns, geolocation of clients and servers, and to explore correlation between properties of the content and accesses.
Abstract: This paper examines the workload of Facebook's photo-serving stack and the effectiveness of the many layers of caching it employs Facebook's image-management infrastructure is complex and geographically distributed It includes browser caches on end-user systems, Edge Caches at ~20 PoPs, an Origin Cache, and for some kinds of images, additional caching via Akamai The underlying image storage layer is widely distributed, and includes multiple data centersWe instrumented every Facebook-controlled layer of the stack and sampled the resulting event stream to obtain traces covering over 77 million requests for more than 1 million unique photos This permits us to study traffic patterns, cache access patterns, geolocation of clients and servers, and to explore correlation between properties of the content and accesses Our results (1) quantify the overall traffic percentages served by different layers: 655% browser cache, 200% Edge Cache, 46% Origin Cache, and 99% Backend storage, (2) reveal that a significant portion of photo requests are routed to remote PoPs and data centers as a consequence both of load-balancing and peering policy, (3) demonstrate the potential performance benefits of coordinating Edge Caches and adopting S4LRU eviction algorithms at both Edge and Origin layers, and (4) show that the popularity of photos is highly dependent on content age and conditionally dependent on the social-networking metrics we considered

225 citations


Proceedings ArticleDOI
23 Jun 2013
TL;DR: This paper introduces Footprint Cache, an efficient die-stacked DRAM cache design for server processors that eliminates the excessive off-chip traffic associated with page-based designs, while preserving their high hit ratio, small tag array overhead, and low lookup latency.
Abstract: Recent research advocates using large die-stacked DRAM caches to break the memory bandwidth wall. Existing DRAM cache designs fall into one of two categories --- block-based and page-based. The former organize data in conventional blocks (e.g., 64B), ensuring low off-chip bandwidth utilization, but co-locate tags and data in the stacked DRAM, incurring high lookup latency. Furthermore, such designs suffer from low hit ratios due to poor temporal locality. In contrast, page-based caches, which manage data at larger granularity (e.g., 4KB pages), allow for reduced tag array overhead and fast lookup, and leverage high spatial locality at the cost of moving large amounts of data on and off the chip.This paper introduces Footprint Cache, an efficient die-stacked DRAM cache design for server processors. Footprint Cache allocates data at the granularity of pages, but identifies and fetches only those blocks within a page that will be touched during the page's residency in the cache --- i.e., the page's footprint. In doing so, Footprint Cache eliminates the excessive off-chip traffic associated with page-based designs, while preserving their high hit ratio, small tag array overhead, and low lookup latency. Cycle-accurate simulation results of a 16-core server with up to 512MB Footprint Cache indicate a 57% performance improvement over a baseline chip without a die-stacked cache. Compared to a state-of-the-art block-based design, our design improves performance by 13% while reducing dynamic energy of stacked DRAM by 24%.

207 citations


Proceedings ArticleDOI
09 Jun 2013
TL;DR: By caching only popular content, MPC is able to cache less content while, at the same time, it still achieves a higher Cache Hit and outperforms existing default caching strategy in CCN.
Abstract: Content Centric Networking (CCN) has recently emerged as a promising architecture to deliver content at large-scale. It is based on named-data where a packet address names content and not its location. Then, the premise is to cache content on the network nodes along the delivery path. An important feature for CCN is therefore to manage the cache of the nodes. In this paper, we present Most Popular Content (MPC), a new caching strategy adapted to CCN networks. By caching only popular content, we show through extensive simulation experiments that MPC is able to cache less content while, at the same time, it still achieves a higher Cache Hit and outperforms existing default caching strategy in CCN.

197 citations


Proceedings ArticleDOI
12 Feb 2013
TL;DR: A novel buffer cache architecture is presented that subsumes the functionality of caching and journaling by making use of non-volatile memory such as PCM or STT-MRAM and shows that this scheme improves I/O performance by 76% on average and up to 240% compared to the existing Linux buffer cache with ext4 without any loss of reliability.
Abstract: Journaling techniques are widely used in modern file systems as they provide high reliability and fast recovery from system failures. However, it reduces the performance benefit of buffer caching as journaling accounts for a bulk of the storage writes in real system environments. In this paper, we present a novel buffer cache architecture that subsumes the functionality of caching and journaling by making use of non-volatile memory such as PCM or STT-MRAM. Specifically, our buffer cache supports what we call the in-place commit scheme. This scheme avoids logging, but still provides the same journaling effect by simply altering the state of the cached block to frozen. As a frozen block still performs the function of caching, we show that in-place commit does not degrade cache performance. We implement our scheme on Linux 2.6.38 and measure the throughput and execution time of the scheme with various file I/O benchmarks. The results show that our scheme improves I/O performance by 76% on average and up to 240% compared to the existing Linux buffer cache with ext4 without any loss of reliability.

171 citations


Proceedings ArticleDOI
12 Aug 2013
TL;DR: This paper designs five different hash-routing schemes which efficiently exploit in-network caches without requiring network routers to maintain per-content state information and shows that such schemes can increase cache hits by up to 31% in comparison to on-path caching, with minimal impact on the traffic dynamics of intra-domain links.
Abstract: Hash-routing has been proposed in the past as a mapping mechanism between object requests and cache clusters within enterprise networks.In this paper, we revisit hash-routing techniques and apply them to Information-Centric Networking (ICN) environments, where network routers have cache space readily available. In particular, we investigate whether hash-routing is a viable and efficient caching approach when applied outside enterprise networks, but within the boundaries of a domain.We design five different hash-routing schemes which efficiently exploit in-network caches without requiring network routers to maintain per-content state information.We evaluate the proposed hash-routing schemes using extensive simulations over real Internet domain topologies and compare them against various on-path caching mechanisms. We show that such schemes can increase cache hits by up to 31% in comparison to on-path caching, with minimal impact on the traffic dynamics of intra-domain links.

142 citations


Journal ArticleDOI
TL;DR: This paper focuses on cache pollution attacks, where the adversary's goal is to disrupt cache locality to increase link utilization and cache misses for honest consumers, and illustrates that existing proactive countermeasures are ineffective against realistic adversaries.

139 citations


Proceedings ArticleDOI
01 May 2013
TL;DR: This work focuses on the cache allocation problem: namely, how to distribute the cache capacity across routers under a constrained total storage budget for the network, and formulate this problem as a content placement problem and obtains the exact optimal solution by a two-step method.
Abstract: Content-Centric Networking (CCN) is a promising framework for evolving the current network architecture, advocating ubiquitous in-network caching to enhance content delivery. Consequently, in CCN, each router has storage space to cache frequently requested content. In this work, we focus on the cache allocation problem: namely, how to distribute the cache capacity across routers under a constrained total storage budget for the network. We formulate this problem as a content placement problem and obtain the exact optimal solution by a two-step method. Through simulations, we use this algorithm to investigate the factors that affect the optimal cache allocation in CCN, such as the network topology and the popularity of content. We find that a highly heterogeneous topology tends to put most of the capacity over a few central nodes. On the other hand, heterogeneous content popularity has the opposite effect, by spreading capacity across far more nodes. Using our findings, we make observations on how network operators could best deploy CCN caches capacity.

133 citations


Journal ArticleDOI
TL;DR: In this article, the authors considered the secure caching problem with the additional goal of minimizing information leakage to an external wiretapper and showed that security can be introduced at a negligible cost, particularly for large number of files and users.
Abstract: Caching is emerging as a vital tool for alleviating the severe capacity crunch in modern content-centric wireless networks. The main idea behind caching is to store parts of popular content in end-users' memory and leverage the locally stored content to reduce peak data rates. By jointly designing content placement and delivery mechanisms, recent works have shown order-wise reduction in transmission rates in contrast to traditional methods. In this work, we consider the secure caching problem with the additional goal of minimizing information leakage to an external wiretapper. The fundamental cache memory vs. transmission rate trade-off for the secure caching problem is characterized. Rather surprisingly, these results show that security can be introduced at a negligible cost, particularly for large number of files and users. It is also shown that the rate achieved by the proposed caching scheme with secure delivery is within a constant multiplicative factor from the information-theoretic optimal rate for almost all parameter values of practical interest.

125 citations


Proceedings ArticleDOI
09 Jul 2013
TL;DR: A practical OS-level cache management scheme for multi-core real-time systems that provides predictable cache performance, addresses the aforementioned problems of existing software cache partitioning, and efficiently allocates cache partitions to schedule a given task set is proposed.
Abstract: Many modern multi-core processors sport a large shared cache with the primary goal of enhancing the statistic performance of computing workloads. However, due to resulting cache interference among tasks, the uncontrolled use of such a shared cache can significantly hamper the predictability and analyzability of multi-core real-time systems. Software cache partitioning has been considered as an attractive approach to address this issue because it does not require any hardware support beyond that available on many modern processors. However, the state-of-the-art software cache partitioning techniques face two challenges: (1) the memory co-partitioning problem, which results in page swapping or waste of memory, and (2) the availability of a limited number of cache partitions, which causes degraded performance. These are major impediments to the practical adoption of software cache partitioning. In this paper, we propose a practical OS-level cache management scheme for multi-core real-time systems. Our scheme provides predictable cache performance, addresses the aforementioned problems of existing software cache partitioning, and efficiently allocates cache partitions to schedule a given task set. We have implemented and evaluated our scheme in Linux/RK running on the Intel Core i7 quad-core processor. Experimental results indicate that, compared to the traditional approaches, our scheme is up to 39% more memory space efficient and consumes up to 25% less cache partitions while maintaining cache predictability. Our scheme also yields a significant utilization benefit that increases with the number of tasks.

Proceedings ArticleDOI
12 Aug 2013
TL;DR: It is shown via trace-driven simulation, that intra-AS cache cooperation improves the system caching performance and reduces considerably the traffic load on the AS gateway links, which is very appealing from an ISP's perspective.
Abstract: The default caching scheme in CCN results in a high redundancy along the symmetric request-response path, and makes the caching system inefficient. Since it was first proposed, much work has been done to improve the general caching performance of CCN. Most new caching schemes attempt to reduce the on-path redundancy by passing information on content redundancy and popularity between nodes. In this paper, we tackle the problem from a different perspective. Instead of curbing the redundancy through special caching decisions in the beginning, we take an orthogonal approach by pro-actively eliminating redundancy via an independent intra-AS procedure. We propose an \textit{intra-AS cache cooperation} scheme, to effectively control the redundancy level within the AS and allow neighbour nodes in an AS to collaborate in serving each other's requests. We show via trace-driven simulation, that intra-AS cache cooperation improves the system caching performance and reduces considerably the traffic load on the AS gateway links, which is very appealing from an ISP's perspective.

Proceedings ArticleDOI
06 May 2013
TL;DR: A novel cache management algorithm for flash-based disk cache, named Lazy Adaptive Replacement Cache (LARC), which can filter out seldom accessed blocks and prevent them from entering cache and improves performance and extends SSD lifetime at the same time.
Abstract: The increasing popularity of flash memory has changed storage systems. Flash-based solid state drive(SSD) is now widely deployed as cache for magnetic hard disk drives(HDD) to speed up data intensive applications. However, existing cache algorithms focus exclusively on performance improvements and ignore the write endurance of SSD. In this paper, we proposed a novel cache management algorithm for flash-based disk cache, named Lazy Adaptive Replacement Cache(LARC). LARC can filter out seldom accessed blocks and prevent them from entering cache. This avoids cache pollution and keeps popular blocks in cache for a longer period of time, leading to higher hit rate. Meanwhile, LARC reduces the amount of cache replacements thus incurs less write traffics to SSD, especially for read dominant workloads. In this way, LARC improves performance and extends SSD lifetime at the same time. LARC is self-tuning and low overhead. It has been extensively evaluated by both trace-driven simulations and a prototype implementation in flashcache. Our experiments show that LARC outperforms state-of-art algorithms and reduces write traffics to SSD by up to 94.5% for read dominant workloads, 11.2-40.8% for write dominant workloads.

Patent
05 Dec 2013
TL;DR: In this article, a cache and/or storage module may be configured to reduce write amplification in a cache storage, which may occur due to an over-permissive admission policy, or it may arise due to the write-once properties of the storage medium.
Abstract: A cache and/or storage module may be configured to reduce write amplification in a cache storage. Cache layer write amplification (CLWA) may occur due to an over-permissive admission policy. The cache module may be configured to reduce CLWA by configuring admission policies to avoid unnecessary writes. Admission policies may be predicated on access and/or sequentiality metrics. Flash layer write amplification (FLWA) may arise due to the write-once properties of the storage medium. FLWA may be reduced by delegating cache eviction functionality to the underlying storage layer. The cache and storage layers may be configured to communicate coordination information, which may be leveraged to improve the performance of cache and/or storage operations.

Proceedings ArticleDOI
17 Jun 2013
TL;DR: An intuitive performance model for cache-coherent architectures is developed and used to develop several optimal and optimized algorithms for complex parallel data exchanges that beat the performance of the highly-tuned vendor-specific Intel OpenMP and MPI libraries.
Abstract: Most multi-core and some many-core processors implement cache coherency protocols that heavily complicate the design of optimal parallel algorithms. Communication is performed implicitly by cache line transfers between cores, complicating the understanding of performance properties. We developed an intuitive performance model for cache-coherent architectures and demonstrate its use with the currently most scalable cache-coherent many-core architecture, Intel Xeon Phi. Using our model, we develop several optimal and optimized algorithms for complex parallel data exchanges. All algorithms that were developed with the model beat the performance of the highly-tuned vendor-specific Intel OpenMP and MPI libraries by up to a factor of 4.3. The model can be simplified to satisfy the tradeoff between complexity of algorithm design and accuracy. We expect that our model can serve as a vehicle for advanced algorithm design.

Proceedings ArticleDOI
18 Nov 2013
TL;DR: An efficient compiler framework for cache bypassing on GPUs is proposed and efficient algorithms that judiciously select global load instructions for cache access or bypass are presented.
Abstract: Graphics Processing Units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is not ideal for those general purpose applications with irregular memory accesses. Hence, GPU vendors have introduced caches in conjunction with scratchpad memory in the recent generations of GPUs. The caches on GPUs are highly-configurable. The programmer or the compiler can explicitly control cache access or bypass for global load instructions. This highly-configurable feature of GPU caches opens up the opportunities for optimizing the cache performance. In this paper, we propose an efficient compiler framework for cache bypassing on GPUs. Our objective is to efficiently utilize the configurable cache and improve the overall performance for general purpose GPU applications. In order to achieve this goal, we first characterize GPU cache utilization and develop performance metrics to estimate the cache reuses and memory traffic. Next, we present efficient algorithms that judiciously select global load instructions for cache access or bypass. Finally, we integrate our techniques into an automatic compiler framework that leverages PTX instruction set architecture. Experiments evaluation demonstrates that compared to cache-all and bypass-all solutions, our techniques can achieve considerable performance improvement.

Proceedings ArticleDOI
Tian Luo1, Siyuan Ma1, Rubao Lee1, Xiaodong Zhang1, Deng Liu2, Li Zhou3 
07 Oct 2013
TL;DR: The design and implementation of S-CAVE, a hypervisor-based SSD caching facility, which effectively manages a storage cache in a Multi-VM environment by collecting and exploiting runtime information from both VMs and storage devices is presented.
Abstract: A unique challenge for SSD storage caching management in a virtual machine (VM) environment is to accomplish the dual objectives: maximizing utilization of shared SSD cache devices and ensuring performance isolation among VMs. In this paper, we present our design and implementation of S-CAVE, a hypervisor-based SSD caching facility, which effectively manages a storage cache in a Multi-VM environment by collecting and exploiting runtime information from both VMs and storage devices. Due to a hypervisor's unique position between VMs and hardware resources, S-CAVE does not require any modification to guest OSes, user applications, or the underlying storage system. A critical issue to address in S-CAVE is how to allocate limited and shared SSD cache space among multiple VMs to achieve the dual goals. This is accomplished in two steps. First, we propose an effective metric to determine the demand for SSD cache space of each VM. Next, by incorporating this cache demand information into a dynamic control mechanism, S-CAVE is able to efficiently provide a fair share of cache space to each VM while achieving the goal of best utilizing the shared SSD cache device. In accordance with the constraints of all the functionalities of a hypervisor, S-CAVE incurs minimum overhead in both memory space and computing time. We have implemented S-CAVE in vSphere ESX, a widely used commercial hypervisor from VMWare. Our extensive experiments have shown its strong effectiveness for various data-intensive applications.

Proceedings ArticleDOI
07 Dec 2013
TL;DR: The Decoupled Compressed Cache (DCC) is proposed, which exploits spatial locality to improve both the performance and energy-efficiency of cache compression and nearly doubles the benefits of previous compressed caches with similar area overhead.
Abstract: In multicore processor systems, last-level caches (LLCs) play a crucial role in reducing system energy by i) filtering out expensive accesses to main memory and ii) reducing the time spent executing in high-power states. Cache compression can increase effective cache capacity and reduce misses, improve performance, and potentially reduce system energy. However, previous compressed cache designs have demonstrated only limited benefits due to internal fragmentation and limited tags. In this paper, we propose the Decoupled Compressed Cache (DCC), which exploits spatial locality to improve both the performance and energy-efficiency of cache compression. DCC uses decoupled super-blocks and non-contiguous sub-block allocation to decrease tag overhead without increasing internal fragmentation. Non-contiguous sub-blocks also eliminate the need for energy-expensive re-compaction when a block's size changes. Compared to earlier compressed caches, DCC increases normalized effective capacity to a maximum of 4 and an average of 2.2 for a wide range of workloads. A further optimized Co-DCC (Co-Compacted DCC) design improves the average normalized effective capacity to 2.6 by co-compacting the compressed blocks in a super-block. Our simulations show that DCC nearly doubles the benefits of previous compressed caches with similar area overhead. We also demonstrate a practical DCC design based on a recent commercial LLC design.

Proceedings ArticleDOI
01 Oct 2013
TL;DR: An analytical model is proposed to evaluate the performance of different caching decision policies in terms of the server-hit rate and expected round-trip time and it is shown that PopCache yields the lowest expectedRound- Trip time compared with three benchmark caching decision Policies.
Abstract: Due to a mismatch between downloading and caching content, the network may not gain significant benefit from the sophisticated in-network caching of information-centric networking (ICN) architectures by using a basic caching mechanism. This paper aims to seek an effective caching decision policy to improve the content dissemination in ICN. We propose PopCache-a caching decision policy with respect to the content popularity-that allows an individual ICN router to cache content more or less in accordance with the popularity characteristic of the content. We propose an analytical model to evaluate the performance of different caching decision policies in terms of the server-hit rate and expected round-trip time. The analysis confirmed by simulation results shows that PopCache yields the lowest expected round-trip time compared with three benchmark caching decision policies, i.e., the always, fixed probability and path-capacity-based probability, and PopCache provides the server-hit rate comparable to the lowest ones.

Proceedings ArticleDOI
14 Apr 2013
TL;DR: In this paper, the authors proposed a distributed and uncoordinated off-path caching architecture to overcome the problem of uncooperative caches in information-centric networks (ICN).
Abstract: Information-centric network (ICN), which is one of the prominent Internet re-design architectures, relies on in-network caching for its fundamental operation. However, previous works argue that the performance of in-network caching is highly degraded with the current cache-along-default-path design, which makes popular objects to be cached redundantly in many places. Thus, it would be beneficial to have a distributed and uncoordinated design. Although cooperative caches could be an answer to this, previous research showed that they are generally unfeasible due to excessive signaling burden, protocol complexity, and a need for fault tolerance. In this work we illustrate the ICN caching problem, and propose a novel architecture to overcome the problem of uncooperative caches. Our design possesses the cooperation property intrinsically. We utilize controlled off-path caching to achieve almost 9-fold increase in cache efficiency, and around 20% increase in server load reduction when compared to the classic on-path caching used in ICN proposals.

Proceedings ArticleDOI
03 Dec 2013
TL;DR: A coordinated cache and bank coloring scheme that is designed to prevent Cache and bank interference simultaneously is presented and implemented in the Linux kernel.
Abstract: In commercial-off-the-shelf (COTS) multi-core systems, the execution times of tasks become hard to predict because of contention on shared resources in the memory hierarchy. In particular, a task running in one processor core can delay the execution of another task running in another processor core. This is due to the fact that tasks can access data in the same cache set shared among processor cores or in the same memory bank in the DRAM memory (or both). Such cache and bank interference effects have motivated the need to create isolation mechanisms for resources accessed by more than one task. One popular isolation mechanism is cache coloring that divides the cache into multiple partitions. With cache coloring, each task can be assigned exclusive cache partitions, thereby preventing cache interference from other tasks. Similarly, bank coloring allows assigning exclusive bank partitions to tasks. While cache coloring and some bank coloring mechanisms have been studied separately, interactions between the two schemes have not been studied. Specifically, while memory accesses to two different bank colors do not interfere with each other at the bank level, they may interact at the cache level. Similarly, two different cache colors avoid cache interference but may not prevent bank interference. Therefore it is necessary to coordinate cache and bank coloring approaches. In this paper, we present a coordinated cache and bank coloring scheme that is designed to prevent cache and bank interference simultaneously. We also developed color allocation algorithms for configuring a virtual memory system to support our scheme which has been implemented in the Linux kernel. In our experiments, we observed that the execution time can increase by 60% due to inter-task interference when we use only cache coloring. Our coordinated approach can reduce this figure down to 12% (an 80% reduction).

Patent
26 Jun 2013
TL;DR: In this article, the authors present a computer implemented method, system, and computer program product for cache management comprising recording metadata of IO sent from the server to a storage array, calculating a distribution of a server cache based on metadata, receiving an IO directed to the storage array and revising an allocation of the server cache to the plurality of storage mediums based on the calculated distribution and the IO.
Abstract: A computer implemented method, system, and computer program product for cache management comprising recording metadata of IO sent from the server to a storage array, calculating a distribution of a server cache based on the metadata, receiving an IO directed to the storage array, and revising an allocation of the server cache to the plurality of storage mediums based on the calculated distribution and the IO.

Proceedings ArticleDOI
07 Jul 2013
TL;DR: This paper proposes a novel caching approach that can achieve a significantly larger reduction in peak rate compared to previously known caching schemes, and argues that the performance of the proposed scheme is within a constant factor from the information-theoretic optimum for all values of the problem parameters.
Abstract: Caching is a technique to reduce peak traffic rates by prefetching popular content in memories at the end users. This paper proposes a novel caching approach that can achieve a significantly larger reduction in peak rate compared to previously known caching schemes. In particular, the improvement can be on the order of the number of end users in the network. Conventionally, cache memories are exploited by delivering requested contents in part locally rather than through the network. The gain offered by this approach, which we term local caching gain, depends on the local cache size (i.e., the cache available at each individual user). In this paper, we introduce and exploit a second, global, caching gain, which is not utilized by conventional caching schemes. This gain depends on the aggregate global cache size (i.e., the cumulative cache available at all users), even though there is no cooperation among the caches. To evaluate and isolate these two gains, we introduce a new, information-theoretic formulation of the caching problem focusing on its basic structure. For this setting, the proposed scheme exploits both local and global caching gains, leading to a multiplicative improvement in the peak rate compared to previously known schemes. Moreover, we argue that the performance of the proposed scheme is within a constant factor from the information-theoretic optimum for all values of the problem parameters.

Proceedings ArticleDOI
01 Sep 2013
TL;DR: Key microarchitectural features of mobile computing platforms that are crucial to the performance of smart phone applications are explored to guide the design of future smart phone platforms for lower power consumptions through simpler architecture while achieving high performance.
Abstract: In this paper, we explore key microarchitectural features of mobile computing platforms that are crucial to the performance of smart phone applications. We create and use a selection of representative smart phone applications, which we call MobileBench that aid in this analysis. We also evaluate the effectiveness of current memory subsystem on the mobile platforms. Furthermore, by instrumenting the Android framework, we perform energy characterization for MobileBench on an existing Samsung Galaxy S III smart phone. Based on our energy analysis, we find that application cores on modern smart phones consume significant amount of energy. This motivates our detailed performance analysis centered at the application cores. Based on our detailed performance studies, we reach several key findings. (i) Using a more sophisticated tournament branch predictor can improve the branch prediction accuracy but this does not translate to observable performance gain. (ii) Smart phone applications show distinct TLB capacity needs. Larger TLBs can improve performance by an avg. of 14%. (iii) The current L2 cache on most smart phone platform experiences poor utilization because of the fast-changing memory requirements of smart phone applications. Using a more effective cache management scheme improves the L2 cache utilization by as much as 29.3% and by an avg. of 12%. (iv) Smart phone applications are prefetching-friendly. Using a simple stride prefetcher can improve performance across MobileBench applications by an avg. of 14%. (v) Lastly, the memory bandwidth requirements of MobileBench applications are moderate and well under current smart phone memory bandwidth capacity of 8.3 GB/s. With these insights into the smart phone application characteristics, we hope to guide the design of future smart phone platforms for lower power consumptions through simpler architecture while achieving high performance.

Proceedings ArticleDOI
Xiaoyan Zhu1, Haotian Chi1, Ben Niu1, Weidong Zhang1, Zan Li1, Hui Li1 
01 Dec 2013
TL;DR: A novel collaborative system, MobiCache, which combines k-anonymity with caching together to protect user's location privacy while improving the cache hit ratio and an enhanced-DSA to further improve the user's privacy as well as the cacheHit ratio.
Abstract: Location-Based Services (LBSs) are becoming increasingly popular in our daily life. In some scenarios, multiple users may seek data of same interest from a LBS server simultaneously or one by one, and they may need to provide their exact locations to the un-trusted LBS server in order to enjoy such a location-based service. Unfortunately, this will breach users' location privacy and security. To address this problem, we propose a novel collaborative system, MobiCache, which combines k-anonymity with caching together to protect user's location privacy while improving the cache hit ratio. Different from the traditional k-anonymity, our Dummy Selection Algorithm (DSA) chooses dummy locations which have not been queried before to increase the cache hit ratio. We also propose an enhanced-DSA to further improve the user's privacy as well as the cache hit ratio by assigning dummy locations which can make more contributions to cache hit ratio. Evaluation results show that the proposed DSA can increase the cache hit ratio and the enhanced-DSA can further improve the cache hit ratio as well as the user's privacy.

Proceedings Article
26 Jun 2013
TL;DR: It is found that the chief benefit of the flash cache is its size, not its persistence, and for some workloads a large flash cache allows using miniscule amounts of RAM for file caching leaving more memory available for application use.
Abstract: Flash memory has recently become popular as a caching medium. Most uses to date are on the storage server side. We investigate a different structure: flash as a cache on the client side of a networked storage environment. We use trace-driven simulation to explore the design space. We consider a wide range of configurations and policies to determine the potential client-side caches might offer and how best to arrange them. Our results show that the flash cache writeback policy does not significantly affect performance. Write-through is sufficient; this greatly simplifies cache consistency handling. We also find that the chief benefit of the flash cache is its size, not its persistence. Cache persistence offers additional performance benefits at system restart at essentially no runtime cost. Finally, for some workloads a large flash cache allows using miniscule amounts of RAM for file caching (e.g., 256 KB) leaving more memory available for application use.

Proceedings ArticleDOI
01 Oct 2013
TL;DR: In-memory object caches, such as memcached, are critical to the success of popular web sites, by reducing database load and improving scalability, but unfortunately cache configuration is poorly understood.
Abstract: Large-scale in-memory object caches such as memcached are widely used to accelerate popular web sites and to reduce burden on backend databases. Yet current cache systems give cache operators limited information on what resources are required to optimally accommodate the present workload. This paper focuses on a key question for cache operators: how much total memory should be allocated to the in-memory cache tier to achieve desired performance?We present our Mimir system: a lightweight online profiler that hooks into the replacement policy of each cache server and produces graphs of the overall cache hit rate as a function of memory size. The profiler enables cache operators to dynamically project the cost and performance impact from adding or removing memory resources within a distributed in-memory cache, allowing "what-if" questions about cache performance to be answered without laborious offline tuning. Internally, Mimir uses a novel lock-free algorithm and lookup filters for quickly and dynamically estimating hit rate of LRU caches.Running Mimir as a profiler requires minimal changes to the cache server, thanks to a lean API. Our experiments show that Mimir produces dynamic hit rate curves with over 98% accuracy and 2--5% overhead on request latency and throughput when Mimir is run in tandem with memcached, suggesting online cache profiling can be a practical tool for improving provisioning of large caches.

Proceedings ArticleDOI
30 Jun 2013
TL;DR: This paper analyzes the added write pressures that cache workloads place on flash devices and proposes optimizations at both the cache and flash management layers to improve endurance while maintaining or increasing cache hit rate.
Abstract: Flash memory is widely used for its fast random I/O access performance in a gamut of enterprise storage applications. However, due to the limited endurance and asymmetric write performance of flash memory, minimizing writes to a flash device is critical for both performance and endurance. Previous studies have focused on flash memory as a candidate for primary storage devices; little is known about its behavior as a Solid State Cache (SSC) device. In this paper, we propose HEC, a High Endurance Cache that aims to improve overall device endurance via reduced media writes and erases while maximizing cache hit rate performance. We analyze the added write pressures that cache workloads place on flash devices and propose optimizations at both the cache and flash management layers to improve endurance while maintaining or increasing cache hit rate. We demonstrate the individual and cumulative contributions of cache admission policy, cache eviction policy, flash garbage collection policy, and flash device configuration on a) hit rate, b) overall writes, and c) erases as seen by the SSC device. Through our improved cache and flash optimizations, 83% of the analyzed workload ensembles achieved increased or maintained hit rate with write reductions up to 20x, and erase count reductions up to 6x.

Proceedings ArticleDOI
28 Jun 2013
TL;DR: This paper investigates the current state of side-channel vulnerabilities involving the CPU cache, and identifies the shortcomings of traditional defenses in a Cloud environment, and develops a mitigation technique applicable for Cloud security.
Abstract: As Cloud services become more common place, recent work have uncovered vulnerabilities unique to Cloud systems. Specifically, the paradigm promotes a risk of information leakage across virtual machine isolation via side-channels. In this paper, we investigate the current state of side-channel vulnerabilities involving the CPU cache, and identify the shortcomings of traditional defenses in a Cloud environment. We explore why solutions to non-Cloud cache-based side-channels cease to work in Cloud environments, and develop a mitigation technique applicable for Cloud security. Applying this solution to a canonical Cloud environment, we demonstrate the validity of this Cloud-specific, cache-based side-channel mitigation technique. Furthermore, we show that it can be implemented as a server-side approach to improve security without inconveniencing the client. Finally, we conduct a comparison of our solution to the current state-of-the-art.

Proceedings ArticleDOI
17 Jun 2013
TL;DR: This framework unifies existing cache miss rate prediction techniques such as Smith's associativity model, Poisson variants, and hardware way-counter based schemes and shows how to adapt LRU way-counters to work when the number of sets in the cache changes.
Abstract: We develop a reuse distance/stack distance based analytical modeling framework for efficient, online prediction of cache performance for a range of cache configurations and replacement policies LRU, PLRU, RANDOM, NMRU. Our framework unifies existing cache miss rate prediction techniques such as Smith's associativity model, Poisson variants, and hardware way-counter based schemes. We also show how to adapt LRU way-counters to work when the number of sets in the cache changes. As an example application, we demonstrate how results from our models can be used to select, based on workload access characteristics, last-level cache configurations that aim to minimize energy-delay product.