scispace - formally typeset
Search or ask a question

Showing papers on "Smart Cache published in 2018"


Proceedings ArticleDOI
01 Feb 2018
TL;DR: KPart is presented, a hybrid cache partitioning-sharing technique that sidesteps the limitations of way-partitioning and unlocks significant performance on current systems, and achieves most of the performance of more advanced partitioning techniques that are not yet available in hardware.
Abstract: Cache partitioning is now available in commercial hardware. In theory, software can leverage cache partitioning to use the last-level cache better and improve performance. In practice, however, current systems implement way-partitioning, which offers a limited number of partitions and often hurts performance. These limitations squander the performance potential of smart cache management. We present KPart, a hybrid cache partitioning-sharing technique that sidesteps the limitations of way-partitioning and unlocks significant performance on current systems. KPart first groups applications into clusters, then partitions the cache among these clusters. To build clusters, KPart relies on a novel technique to estimate the performance loss an application suffers when sharing a partition. KPart automatically chooses the number of clusters, balancing the isolation benefits of way-partitioning with its potential performance impact. KPart uses detailed profiling information to make these decisions. This information can be gathered either offline, or online at low overhead using a novel profiling mechanism. We evaluate KPart in a real system and in simulation. KPart improves throughput by 24% on average (up to 79%) on an Intel Broadwell-D system, whereas prior per-application partitioning policies improve throughput by just 1.7% on average and hurt 30% of workloads. Simulation results show that KPart achieves most of the performance of more advanced partitioning techniques that are not yet available in hardware.

96 citations


Journal ArticleDOI
TL;DR: This paper proposes a new system architecture that combines two recent ideas, distributed caching of content in small cells (FemtoCaching), and, cooperative transmissions from nearby base stations (Coordinated Multi-Point).
Abstract: The demand for higher and higher wireless data rates is driven by the popularity of mobile video content delivery through wireless devices such as tablets and smartphones. To achieve unprecedented mobile content delivery speeds while reducing backhaul cost and delay, in this paper we propose a new system architecture that combines two recent ideas, distributed caching of content in small cells (FemtoCaching), and, cooperative transmissions from nearby base stations (Coordinated Multi-Point). A key characteristic of the proposed architecture is the interdependence between the caching strategy and the physical layer coordination. Specifically, the caching strategy may cache different content in nearby base stations (BSs) to maximize the cache hit ratio, or cache the same content in multiple nearby BSs such that the corresponding BSs can transmit concurrently, e.g., to multiple users using zero-forcing beamforming, and achieve multiplexing gains. Such interdependency allows a joint cross-layer optimization. Given the popularity distribution of the content, the available cache size, and the network topology, we devise near-optimal strategies of caching such that the system throughput is maximized or the system delay is minimized. Under realistic scenarios and assumptions, our analytical and simulation results show that our system yields significantly faster content delivery, which can be one order of magnitude faster than that of legacy systems.

36 citations


Proceedings ArticleDOI
TL;DR: In this paper, the authors defined four important characteristics of a suitable eviction policy for information centric networks (ICN) and proposed a new eviction scheme which is well suitable for ICN type of cache networks.
Abstract: The information centric networks (ICN) can be viewed as a network of caches. Conversely, ICN type of cache networks has distinctive features e.g, contents popularity, usability time of content and other factors inflicts some diverse requirements for cache eviction policies. In this paper we defined four important characteristics of a suitable eviction policy for ICN. We analysed well known eviction policies in view of defined characteristics. Based upon analysis we propose a new eviction scheme which is well suitable for ICN type of cache networks.

20 citations


Journal ArticleDOI
TL;DR: This paper gives an analytical method to find the miss rate of L2 cache for various configurations from the RD profile with respect to L1 cache and considers all three types of cache inclusion policies namely (i) Strictly Inclusive, (ii) Mutually Exclusive and (iii) Non-Inclusive Non-Exclusive.
Abstract: Reuse distance is an important metric for analytical estimation of cache miss rate. To find the miss rate of a particular cache, the reuse distance profile has to be measured for that particular level and configuration of the cache. Significant amount of simulation time and overhead can be reduced if we can find the miss rate of higher level cache like L2 cache from the RD profile with respect to a lower level cache (i.e., cache that is closer to the processor) such as L1. The objective of this paper is to give an analytical method to find the miss rate of L2 cache for various configurations from the RD profile with respect to L1 cache. We consider all three types of cache inclusion policies namely (i) Strictly Inclusive, (ii) Mutually Exclusive and (iii) Non-Inclusive Non-Exclusive policy. We first prove some general results relating the RD profile of L1 cache to that of L2 cache. We use probabilistic analysis for our derivations. We validate our model against simulations, using the multi-core simulator Sniper with the PARSEC and the SPLASH benchmark suites.

17 citations


Journal ArticleDOI
TL;DR: The results show that the one-side layout achieves the best performance and the lowest power consumption with the considered hw–sw optimizations, and software based profile driven optimization allows the system to achieve the lowest usage of network resources.

14 citations


Journal ArticleDOI
TL;DR: Experimental results demonstrate that the server block-level cache optimization can effectively reduce the amount of server disk I/O, improving the concurrent ability of the server.
Abstract: In mobile transparent computing, a large number of concurrent data requests from heterogeneous clients via the network need to be processed in a timely fashion, and servers have to repeatedly fetch (search and read) the data from storage, which may cause numerous I/O costs. Generally, disk access speed is more limited than memory; therefore, massive I/O operations at servers may become a bottleneck for the system, and transport delay of the total network caused by the limitation of wireless bandwidth and stability may lead to poorer user experience. Hence, caching method plays a significant role in performance improvement of transparent computing systems. In this paper, we propose a block-level caching optimization method for the server and client by analyzing the system bottleneck in mobile transparent computing. We first analyze the storage format of the data file and the three-layer structure in the server according to the characteristics of requesting data from the client to the server and propose a block-level cache based on the access time and access frequency for the server. Second, considering the restriction of bandwidth and stability of the wireless network, we analyze network boot processes from the client’s startup and propose a client block-level cache optimization combined with local storage access technology. Finally, experimental results demonstrate that the server block-level cache optimization can effectively reduce the amount of server disk I/O, improving the concurrent ability of the server. In addition, the client block-level cache can significantly increase startup speed of the client, reduce network traffic and improve user experience.

8 citations


Journal ArticleDOI
TL;DR: This article provides key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache), and designs efficient heuristics for Amdahl applications.
Abstract: Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multi-core machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are: (i) which proportion of cache and (ii) how many processors should be given to each application? In this paper, we provide answers to (i) and (ii) for Amdahl applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for Amdahl applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed.

6 citations


Journal ArticleDOI
TL;DR: The simulation results show that the proposed CCSP scheme improves significantly the cache effectiveness and the network performances, by improving data availability and reducing both overall network load and latencies perceived by end users.
Abstract: In wireless mobile Ad Hoc networks, cooperative cache management is considered as an efficient technique to increase data availability and improve access latency. This technique is based on coordination and sharing of cached data between nodes belonging to the same area. In this paper, we studied the cooperative cache management strategies. This has enabled us to propose a collaborative cache management scheme for mobile Ad Hoc networks, based on service cache providers (SCP), called cooperative caching based on service providers (CCSP). The proposed scheme enabled the election of some SCPs mobile nodes, which receive cache’s summaries of neighboring nodes. Thus, nodes belonging to the same zone can locate easily cached documents of that area. The election mechanism used in this approach is executed periodically to ensure load balancing. We further provided an evaluation of the proposed solution, in terms of request hit rate, byte hit rate and time gains. Compared with other caching management schemes, the simulation results show that the proposed CCSP scheme improves significantly the cache effectiveness and the network performances. This is achieved by improving data availability and reducing both overall network load and latencies perceived by end users.

5 citations


Journal ArticleDOI
TL;DR: PhLock as mentioned in this paper leverages an application's varying runtime characteristics to dynamically select the locked memory contents to optimize cache energy consumption, which is a popular cache optimization that loads and retains/locks selected memory contents from an executing application into the cache to increase cache's predictability.
Abstract: Caches are commonly used to bridge the processor-memory performance gap in embedded systems. Since embedded systems typically have stringent design constraints imposed by physical size, battery capacity, and real-time deadlines much research focuses on cache optimizations, such as improved performance and/or reduced energy consumption. Cache locking is a popular cache optimization that loads and retains/locks selected memory contents from an executing application into the cache to increase the cache’s predictability. Previous work has shown that cache locking also has the potential to improve cache energy consumption. In this paper, we introduce phase-based cache locking, PhLock , which leverages an application’s varying runtime characteristics to dynamically select the locked memory contents to optimize cache energy consumption. Using a variety of applications from the SPEC2006 and MiBench benchmark suites, experimental results show that PhLock is promising for reducing both the instruction and data caches’ energy consumption. As compared to a nonlocking cache, PhLock reduced the instruction and data cache energy consumption by an average of 5% and 39%, respectively, for SPEC2006 applications, and by 75% and 14%, respectively, for MiBench benchmarks.

5 citations


Journal ArticleDOI
TL;DR: A replacement policy adaptable miss curve estimation (RME) which estimates dynamic workload patterns according to any arbitrary replacement policy and to given applications with low overhead is proposed which supports the efficiency of RME and shows that RME-based cache partitioning cooperated with high-performance replacement policies can minimize both inter- and intra-application interference successfully.
Abstract: Cache replacement policies and cache partitioning are well-known cache management techniques which aim to eliminate inter- and intra-application contention caused by co-running applications, respectively. Since replacement policies can change applications’ behavior on a shared last-level cache, they have a massive impact on cache partitioning. Furthermore, cache partitioning determines the capacity allocated to each application affecting incorporated replacement policy. However, their interoperability has not been thoroughly explored. Since existing cache partitioning methods are tailored to specific replacement policies to reduce overheads for characterization of applications’ behavior, they may lead to suboptimal partitioning results when incorporated with the up-to-date replacement policies. In cache partitioning, miss curve estimation is a key component to relax this restriction which can reflect the dependency between a replacement policy and cache partitioning on partitioning decision. To tackle this issue, we propose a replacement policy adaptable miss curve estimation (RME) which estimates dynamic workload patterns according to any arbitrary replacement policy and to given applications with low overhead. In addition, RME considers asymmetry of miss latency by miss type, thus the impact of miss curve on cache partitioning can be reflected more accurately. The experimental results support the efficiency of RME and show that RME-based cache partitioning cooperated with high-performance replacement policies can minimize both inter- and intra-application interference successfully.

4 citations


Journal ArticleDOI
TL;DR: A novel cache model is proposed that captures the dependency among the segments in the cache server for adaptive HTTP streaming and can significantly improve the cache hit-ratio and QoE of HTTP streaming as compared to previous approaches.
Abstract: There has been significant interest in the use of HTTP adaptive streaming for live or on-demand video over the Internet in recent years. To mitigate the streaming transmission delay and reduce the networking overhead, an effective and critical approach is to utilize cache services between the origin servers and the heterogeneous clients. As the underlying protocol for web transactions, HTTP has great potentials to explore the resources within state-of-the-art CDNs for caching; yet distinct challenges arise in the HTTP adaptive streaming context. After examining a long-term and large-scale adaptive streaming dataset as well as statistical analysis, we demonstrate that the switching requests among the different qualities frequently emerge and constitute a significant portion in a per-day view. Consequently, they have substantially affected the performance of cache servers and Quality-of-Experience (QoE) of viewers. In this paper, we propose a novel cache model that captures the dependency among the segments in the cache server for adaptive HTTP streaming. Our work does not assume any specific selection algorithm on the client’s side and hence can be easily incorporated into existing streaming cache systems. Its centralized nature is also well accommodated by the latest DASH specification. Moreover, we extend our work to the multi-server caching context and present a similarity-aware allocation mechanism to enhance the caching efficiency. The performance evaluation shows our dependency- and similarity-aware strategy can significantly improve the cache hit-ratio and QoE of HTTP streaming as compared to previous approaches.

Journal ArticleDOI
TL;DR: This work proposes new adaptive multi-level exclusive caching policies that can dynamically adjust replacement and placement decisions in response to changing access patterns and achieves multi- level exclusive caching with significant cache performance improvement.

Journal ArticleDOI
TL;DR: This paper evaluates compressibility of L1 data caches and L2 cache in general-purpose graphics processing units (GPGPUs) and proposes using cache compression to increase effective cache capacity, improve performance, and reduce power consumption.
Abstract: In this paper, we evaluate compressibility of L1 data caches and L2 cache in general-purpose graphics processing units (GPGPUs). Our proposed scheme is geared toward improving performance and power of GPGPUs through cache compression. GPGPUs are throughput-oriented devices which execute thousands of threads simultaneously. To handle working set of this massive number of threads, modern GPGPUs exploit several levels of caches. GPGPU design trend shows that the size of caches continues to grow to support even more thread level parallelism. We propose using cache compression to increase effective cache capacity, improve performance, and reduce power consumption in GPGPUs. Our work is motivated by the observation that the values within a cache block are similar, i.e., the arithmetic difference of two successive values within a cache block is small. To reduce data redundancy in L1 data caches and L2 cache, we use low-cost and implementation-efficient base-delta-immediate (BDI) algorithm. BDI replaces a cache block with a base and an array of deltas where the combined size of the base and deltas is less than the original cache block. We also study locality of fields in integer and floating-point numbers. We found that entropy of fields varies across different data types. Based on entropy, we offer different BDI compression schemes for integer and floating-point numbers. We augment a simple, yet effective, predictor that determines type of values dynamically in hardware and without the help of a compiler or a programmer. Evaluation results show that on average, cache compression improves performance by 8% and saves energy of caches by 9%.

Journal ArticleDOI
TL;DR: The results suggest that variable cache line size can result in better performance and can also conserve power, and present runtime cache utilization, as well as, conventional performance metrics that illustrate a holistic understanding of cache behavior.
Abstract: Cache has long been used to minimize the latency of main memory accesses by storing frequently used data near the processor. Processor performance depends on the underlying cache performance. Therefore, significant research has been done to identify the most crucial metrics of cache performance. Although the majority of research focuses on measuring cache hit rates and data movement as the primary cache performance metrics, cache utilization is significantly important. We investigate the application’s locality using cache utilization metrics. Furthermore, we present cache utilization and traditional cache performance metrics as the program progresses providing detailed insights into the dynamic application behavior on parallel applications from four benchmark suites running on multiple cores. We explore cache utilization for APEX, Mantevo, NAS, and PARSEC, mostly scientific benchmark suites. Our results indicate that 40% of the data bytes in a cache line are accessed at least once before line eviction. Also, on average a byte is accessed two times before the cache line is evicted for these applications. Moreover, we present runtime cache utilization, as well as, conventional performance metrics that illustrate a holistic understanding of cache behavior. To facilitate this research, we build a memory simulator incorporated into the Structural Simulation Toolkit (Rodrigues et al. in SIGMETRICS Perform Eval Rev 38(4):37–42, 2011). Our results suggest that variable cache line size can result in better performance and can also conserve power.

Proceedings Article
08 Sep 2018
TL;DR: Simulation results show that the performance of multimedia systems and applications can be enhanced by optimizing level-2 cache, and this work develops VisualSim model and C++ code to run the simulation.

Book ChapterDOI
01 Jan 2018
TL;DR: The proposed cache reuse replacement policy manages cache blocks by separating reused cache blocks and thrashing cache blocks, which can increase IPC by up to 4.4% compared to the conventional GPU architecture.
Abstract: The performance of computing systems has been improved significantly for several decades. However, increasing the throughput of recent CPUs (Central Processing Units) is restricted by power consumption and thermal issues. GPUs (Graphics Processing Units) are recognized as efficient computing platform with powerful hardware resources to support CPUs in computing systems. Unlike CPUs, there is a large number of CUDA (Compute Unified Device Architecture) cores in GPUs, hence, some cache blocks are referenced many times repeatedly. If those cache blocks reside in the cache for long time, hit rates can be improved. On the other hand, many cache blocks are referenced only once and never referenced again in the cache. These blocks waste cache memory space, resulting in reduced GPU performance. Conventional LRU replacement policy cannot consider the problems from non-reused cache blocks and frequently-reused cache blocks. In this paper, a new cache replacement policy based on the reuse pattern of cache blocks is proposed. The proposed cache replacement policy manages cache blocks by separating reused cache blocks and thrashing cache blocks. According to simulation results, the proposed cache reuse replacement policy can increase IPC by up to 4.4% compared to the conventional GPU architecture.

Journal ArticleDOI
TL;DR: Numerically evaluate the proposed CLOCK-Pro Using Switching Hash-tables (CUSH) as the suitable policy for network caching, showing that the proposal can achieve cache hits against the traffic traces that simple conventional algorithms hardly cause any hits.
Abstract: Information-centric networking (ICN) has received increasing attention from all over the world. The novel aspects of ICN (e.g., the combination of caching, multicasting, and aggregating requests) is based on names that act as addresses for content. The communication with name has the potential to cope with the growing and complicating Internet technology, for example, Internet of Things, cloud computing, and a smart society. To realize ICN, router hardware must implement an innovative cache replacement algorithm that offers performance far superior to a simple policy-based algorithm while still operating with feasible computational and memory overhead. However, most previous studies on cache replacement policies in ICN have proposed policies that are too blunt to achieve significant performance improvement, such as firstin first-out (popularly, FIFO) and random policies, or impractical policies in a resource-restricted environment, such as least recently used (LRU). Thus, we propose CLOCK-Pro Using Switching Hash-tables (CUSH) as the suitable policy for network caching. CUSH can identify and keep popular content worth caching in a network environment. CUSH also employs CLOCK and hash-tables, which are low-overhead data structure, to satisfy the cost requirement. We numerically evaluate our proposed approach, showing that our proposal can achieve cache hits against the traffic traces that simple conventional algorithms hardly cause any hits. key words: Information-centric networking, Content-centric networking, Caching, Cache replacement algorithm

Patent
Jensen Claus T1, Arnold Joseph, Pierce Jr John A, Robert Samuel, Ganesan Sriram 
01 Mar 2018
TL;DR: In this paper, the authors propose a method of integrating data across multiple data stores in a smart cache in order to provide data to one or more recipient systems, which includes automatically ingesting diverse data from multiple data sources, automatically reconciling the ingested diverse data by updating semantic models based on the ingestion of diverse data, storing the ingestion diverse data based on one or multiple classification of the data sources according to the semantic models, automatically generating scalable service endpoints which are semantically consistent according to classification of data sources.
Abstract: An embodiment of the disclosure provides a method of integrating data across multiple data stores in a smart cache in order to provide data to one or more recipient systems. The method includes automatically ingesting diverse data from multiple data sources, automatically reconciling the ingested diverse data by updating semantic models based on the ingested diverse data, storing the ingested diverse data based on one or more classification of the data sources according to the semantic models, automatically generating scalable service endpoints which are semantically consistent according to the classification of the data sources, and responding to a call from the one or more recipient systems by providing data in the classification of the data sources.