scispace - formally typeset
Search or ask a question

Showing papers on "Smart Cache published in 2023"


Journal ArticleDOI
TL;DR: In this article , the authors proposed an efficient cooperative caching (FDDL) framework to address the issues in mobile edge networks, which extracts a boarder set of attributes from massive requests to improve the cache efficiency.
Abstract: Edge caching has been regarded as a promising technique for low-latency, high-rate data delivery in future networks, and there is an increasing interest to leverage Machine Learning (ML) for better content placement instead of traditional optimization-based methods due to its self-adaptive ability under complex environments. Despite many efforts on ML-based cooperative caching, there are still several key issues that need to be addressed, especially to reduce computation complexity and communication costs under the optimization of cache efficiency. To this end, in this paper, we propose an efficient cooperative caching (FDDL) framework to address the issues in mobile edge networks. Particularly, we propose a DRL-CA algorithm for cache admission, which extracts a boarder set of attributes from massive requests to improve the cache efficiency. Then, we present an lightweight eviction algorithm for fine-grained replacements of unpopular contents. Moreover, we present a Federated Learning-based parameter sharing mechanism to reduce the signaling overheads in collaborations. We implement an emulation system and evaluate the caching performance of the proposed FDDL. Emulation results show that the proposed FDDL can achieve a higher cache hit ratio and traffic offloading rate than several conventional caching policies and DRL-based caching algorithms, and effectively reduce communication costs and training time.

2 citations


Journal ArticleDOI
TL;DR: In this paper , the authors proposed a hybrid caching strategy called time segmentation-based hybrid caching (TSBC) strategy, based on the 5G-ICN bearer network infrastructure.
Abstract: The fifth-generation communication technology (5G) and information-centric networks (ICNs) are acquiring more and more attention. Cache plays a significant part in the 5G-ICN architecture that the industry has suggested. 5G mobile terminals switch between different base stations quickly, creating a significant amount of traffic and a significant amount of network latency. This brings great challenges to 5G-ICN mobile cache. It appears urgent to improve the cache placement strategy. This paper suggests a hybrid caching strategy called time segmentation-based hybrid caching (TSBC) strategy, based on the 5G-ICN bearer network infrastructure. A base station’s access frequency can change throughout the course of the day due to the “tidal phenomena” of mobile networks. To distinguish the access frequency, we split each day into periods of high and low liquidity. To maintain the diversity of cache copies during periods of high liquidity, we replace the path’s least-used cache copy. We determine the cache value of each node in the path and make caching decisions during periods of low liquidity to make sure users can access the content they are most interested in quickly. The simulation results demonstrate that the proposed strategy has a positive impact on both latency and the cache hit ratio.

2 citations


Proceedings ArticleDOI
08 May 2023
TL;DR: BigKV as mentioned in this paper is a key-value cache specifically designed for caching large objects in an all-flash array (AFA), which is centered around the unique property of a cache: since it contains a copy of the data, exact bookkeeping of what is in the cache is not critical for correctness.
Abstract: We present BigKV, a key-value cache specifically designed for caching large objects in an all-flash array (AFA). The design of BigKV is centered around the unique property of a cache: since it contains a copy of the data, exact bookkeeping of what is in the cache is not critical for correctness. By ignoring hash collisions, approximating metadata information, and allowing data loss from failures, BigKV significantly increases the cache hit ratio and keeps more useful objects in the system. Experiments on a real AFA show that our design increases the throughput by 3.1× on average and reduces the average and tail latency by 57% and 81%, respectively.

1 citations


Proceedings ArticleDOI
07 Jun 2023
TL;DR: In this paper , the authors consider both preemptive and non-preemptive scheduling policies on single-processor systems and formulate the problem as an integer quadratically constrained program and propose an efficient heuristic achieving near-optimal solutions.
Abstract: Cache partitioning is a technique to reduce interference among tasks accessing the shared caches. To make this technique effective, cache segments must be given to the tasks that can benefit most from having their data and instructions cached for faster execution. The existing partitioning schemes for real-time systems divide the available cache among the tasks to guarantee their schedulability which is the sole optimization criterion. However, it is also preferable, especially in systems with power constraints or mixed criticalities, to reduce the total cache usage for real-time tasks. In this paper, we develop optimization algorithms for cache partitioning that, besides ensuring schedulability, also minimize cache usage. We consider both preemptive and non-preemptive scheduling policies on single-processor systems. For preemptive scheduling, we formulate the problem as an integer quadratically constrained program and propose an efficient heuristic achieving near-optimal solutions. For non-preemptive scheduling, we combine linear and binary search techniques with different schedulability tests. Our experiments based on synthetic task sets with parameters from real-world embedded applications show that the proposed heuristic: (i) achieves an average optimality gap of 0.79% within 0.1x run time of a mathematical programming solver and (ii) reduces average cache usage by 39.15% compared to existing cache partitioning approaches. Besides, we find that for large task sets with high utilization, non-preemptive scheduling can use less cache than preemptive to guarantee schedulability.

1 citations


Proceedings ArticleDOI
08 May 2023
TL;DR: FrozenHot as discussed by the authors partitions the cache space into two parts: a frozen cache and a dynamic cache to serve requests for hot objects with minimal latency by eliminating promotion and locking, while the latter leverages the existing cache design to achieve workload adaptivity.
Abstract: Caching is crucial for accelerating data access, employed as a ubiquitous design in modern systems at many parts of computer systems. With increasing core count, and shrinking latency gap between cache and modern storage devices, hit-path scalability becomes increasingly critical. However, existing production in-memory caches often use list-based management with promotion on each cache hit, which requires extensive locking and poses a significant overhead for scaling beyond a few cores. Moreover, existing techniques for improving scalability either (1) only focus on the indexing structure and do not improve cache management scalability, or (2) sacrifice efficiency or miss-path scalability. Inspired by highly skewed data popularity and short-term hotspot stability in cache workloads, we propose Frozen-Hot, a generic approach to improve the scalability of list-based caches. FrozenHot partitions the cache space into two parts: a frozen cache and a dynamic cache. The frozen cache serves requests for hot objects with minimal latency by eliminating promotion and locking, while the latter leverages the existing cache design to achieve workload adaptivity. We built FrozenHot as a library that can be easily integrated into existing systems. We demonstrate its performance by enabling FrozenHot in two production systems: HHVM and RocksDB using under 100 lines of code. Evaluated using production traces from MSR and Twitter, FrozenHot improves the throughput of three baseline cache algorithms by up to 551%. Compared to stock RocksDB, FrozenHot-enhanced RocksDB shows a higher throughput on all YCSB workloads with up to 90% increase, as well as reduced tail latency.

1 citations


Proceedings ArticleDOI
07 Jun 2023
TL;DR: In this article , a cache-aware allocation for parallel jobs that are organized as directed acyclic graphs (DAGs) is proposed, which operates at a higher abstraction level that allocates jobs to cores, based on the guidance of a predictive model that approximates the execution time of jobs with caching effects taken into account.
Abstract: Scheduling of tasks on multi- and many-cores benefits significantly from the efficient use of caches. Most previous approaches use the static analysis of software in the context of the processing hardware to derive fixed allocations of software to the cache. However, there are many issues with this approach in terms of pessimism, scalability, analysis complexity, maintenance cost, etc. Furthermore, with ever more complex functionalities being implemented in the system, it becomes nearly impracticable to use static analysis for deriving cache-aware scheduling methods. This paper focuses on a dynamic approach to maximise the throughput of multi-core systems by benefiting from the cache based on empirical assessments. The principal contribution is a novel cache-aware allocation for parallel jobs that are organised as directed acyclic graphs (DAGs). Instead of allocating instruction and data blocks to caches, the proposed allocation operates at a higher abstraction level that allocates jobs to cores, based on the guidance of a predictive model that approximates the execution time of jobs with caching effects taken into account. An implementation of the predictive model is constructed to demonstrate that the execution time approximations can be effectively obtained. The experimental results, including a real-world case study, prove the concept of the proposed cache-aware allocation approach and demonstrate its effectiveness over the state-of-the-art.

1 citations


Proceedings ArticleDOI
17 May 2023
TL;DR: In this paper , the authors present a Guard Cache to obfuscate cache timing, making it more difficult for cache timing attacks to succeed, by using the Guard Cache as a victim cache, and false cache misses by randomly evicting cache lines.
Abstract: Cache side-channel attacks have exposed serious security vulnerabilities in modern architectures. These attacks rely on measuring cache access times to determine if an access to an address is a hit or a miss in the cache. Such information can be used to identify which addresses were accessed by the victim, which in turn can be used to reveal or at least guess the information accessed by the victim. Mitigating the attacks while preserving the performance has been a challenge. The hardware mitigation techniques used in the literature include complex cache indexing mechanisms, partitioning cache memories, and hiding or undoing the effects of speculation. In this paper, we present a Guard Cache to obfuscate cache timing, making it more difficult for cache timing attacks to succeed. We create false cache hits by using the Guard Cache as a Victim Cache, and false cache misses by randomly evicting cache lines. Our obfuscations can be turned-on and turned-off on demand to protect critical sections or randomly to further obfuscate cache access times. We show that our false hits cause very minimal performance penalties ranging between −0.2% to 3.0% performance loss, while false misses can cause higher performance losses. We also show that our approach causes different number of cache hits and misses and different addresses causing misses when compared to traditional caches, demonstrating that common side-channel attacks such as Prime & Probe, Flush & Reload or Evict & Time are likely to misinterpret victims’ memory accesses. We use very small Guard Caches (1KiB-2KiB at L1 or 2KiB-4KiB at L2) requiring very minimal additional hardware. The hardware needed for random evictions is also minimal.

Book ChapterDOI
TL;DR: In this article , the cache is divided into two parts, one part is used to store looping reference data, and the rest of the cache uses an ML-based algorithm to manage.
Abstract: Cache is used to reduce performance differences between storage layers. It is widely used in databases, operating systems, network systems, and applications. Loop reference pattern where blocks are referenced repeatedly with regular intervals is a common phenomenon during data referencing. Good management of looping reference blocks can effectively help improve the performance of cache management. In this work, we propose a loop assistant cache replacement (LOACR) policy. We divide the cache into two parts, one part is used to store looping reference data, and the rest of the cache uses an ML-based algorithm to manage. We regularly identify the looping reference pattern and the specific information of the loop at the end of every window. At the same time, we will place the looping reference data that may appear in the next window into the cache in advance to improve the hit rate of the cache. The remaining space in the cache drives the LRU and LFU specialists to cache replacement through a parameter-free machine learning approach. Finally, we evaluated LOACR across a broad range of experiments using multiple sets of cache configurations across multiple data sets.

Journal ArticleDOI
TL;DR: In this article , a fine-grained in-memory database benchmark is developed to evaluate the performance of each operator on different CPUs to explore how CPU hardware architectures influence performance, and they find out that the micro cache architectures play an important role opposite to core count and cache size.
Abstract: Modern CPUs keep integrating more cores and large size cache, which is beneficial for in-memory databases to improve parallel processing power and cache locality. While state-of-the-art CPUs have diverse architectures and roadmaps such as large core count and large cache size (AMD x86), moderate core count and cache size (intel x86), large core count and moderate cache size (ARM), exploring in-memory databases performance characteristics for different CPU architectures is important for in-memory database designs and optimizations. In this paper, we develop a fine-grained in-memory database benchmark to evaluate the performance of each operator on different CPUs to explore how CPU hardware architectures influence performance. Different from well known conclusions that more cores and larger cache size can achieve higher performance, we find out that the micro cache architectures play an important role opposite to core count and cache size, the shared monolithic L3 cache with moderate size beats large disaggregated L3 cache. The experiments also show that predicting operator performance on different CPUs is difficult according to diverse CPU architectures and micro cache architectures, and different implementations of each operator are not always high or low with interleaved strong and weak performance regions influenced by CPU hardware architectures. Intel x86 CPUs represent cache-centric processor design, while AMD x86 and ARM CPUs represent computing-centric processor design, the OLAP benchmark experiments of SSB discover that OmniSciDB and OLAP Accelerator with vector-wise processing model performs well on intel x86 CPUs compared to AMD x86 CPUs and the JIT compliant based Hyper prefers to AMD x86 CPUs rather than intel x86 CPUs. The CPU roadmaps of increasing cores or improving cache locality should be considered for in-memory database algorithm design and platform selection.

Proceedings ArticleDOI
26 Apr 2023
TL;DR: In this paper , a real-time proactive edge content caching using deep reinforcement learning is proposed, which trains cache servers to perform cache prefetch in real time on each request, thus combining the cache pre-fetch and cache replacement without requiring users' privacy information.
Abstract: Edge content cache (ECC) is considered a key technology in Intelligent Transportation Systems (ITS). It can meet the growing demand for computing-intensive and latency-sensitive vehicular applications while improving communication efficiency. Existing ECC schemes can only achieve cache prefetch by predicting content demands in the future time period, or merely implement cache replacement based on current requests, however, ignoring the future. These schemes still have the potential to improve performance with both current and future request context information combined. We propose a real-time proactive edge content caching using deep reinforcement learning. This new scheme trains cache servers to perform cache prefetch in real time on each request, thus combining the cache prefetch and cache replacement without requiring users’ privacy information. In addition, it enables collaborations between multiple location-related cache servers. This new solution can improve the efficiency of ECC and reduce the overall communication cost while protecting user privacy. Simulation results indicate that this scheme outperforms other baseline schemes in communication cost and cache hit ratio.

Journal ArticleDOI
TL;DR: In this paper , the authors present a performance comparison simulation of the seven cache replacement algorithms on various internet traffic extracted from the public IRcache dataset and show that the hit ratio performance is strongly influenced by cache size, cacheable and unique requests.
Abstract: Internet users tend to skip and look for alternative websites if they have slow response times. For cloud network managers, implementing a caching strategy on the edge network can help lighten the workload of databases and application servers. The caching strategy is carried out by storing frequently accessed data objects in cache memory. Through this strategy, the speed of access to the same data becomes faster. Cache replacement is the main mechanism of the caching strategy. There are seven cache replacement algorithms with good performance that can be used, namely LRU, LFU, LFUDA, GDS, GDSF, SIZE, and FIFO. The algorithm is developed uniquely according to the internet traffic patterns encountered. Therefore, a particular cache replacement algorithm cannot be superior to other algorithms. This paper presents a performance comparison simulation of the seven cache replacement algorithms on various internet traffic extracted from the public IRcache dataset. The results of this study indicate that the hit ratio performance is strongly influenced by cache size, cacheable and unique requests. The smaller the unique request that occurs, the greater the hit ratio performance obtained. The LRU algorithm shows an excellent hit ratio performance to perform cache replacement work under normal internet conditions. However, when the access impulse phenomenon occurs, the GDSF algorithm is superior in obtaining hit ratios with limited cache memory capacity. The simulation results show that GDSF reaches a 50.75% hit ratio while LRU is only 49.17% when access anomalies occur.

Journal ArticleDOI
TL;DR: In this article , a new type of programmable cache called the lease cache is proposed, where software exerts the primary control on when and how long data stays in the cache.
Abstract: Cache management is important in exploiting locality and reducing data movement. This paper studies a new type of programmable cache called the lease cache. By assigning leases, software exerts the primary control on when and how long data stays in the cache. Previous work has shown an optimal solution for an ideal lease cache. This paper develops and evaluates a set of practical solutions for a physical lease cache emulated in FPGA with the full suite of PolyBench benchmarks. Compared to automatic caching, lease programming can further reduce data movement by 10% to over 60% when the data size is 16 times to 3,000 times the cache size, and the techniques in this paper realize over 80% of this potential. Moreover, lease programming can reduce data movement by another 0.8% to 20% after polyhedral locality optimization.

Proceedings ArticleDOI
05 Jun 2023
TL;DR: Zhang et al. as mentioned in this paper proposed to use the buffer of the fixed graphics pipeline as a victim cache to enhance the GPU cache and optimize the victim cache management strategy by monitoring load reusability.
Abstract: The limited cache size of GPU in general-purpose computing hinders the execution efficiency of thousands of concurrent threads. Several techniques have been proposed to increase the cache size per thread, such as repurposing shared memory and register files as a cache to reduce contention. However, these studies only focus on improving the general-purpose computing hardware structure, ignoring the fixed graphics pipeline hardware structure. To solve this issue, we propose repurposing the buffer of the fixed graphics pipeline as a victim cache to enhance the GPU cache. This strategy utilizes the idle fixed graphics pipeline buffer as a victim cache for general-purpose computing tasks. The victim cache returns data to the L1 data cache when a load is located at the victim cache. We also optimize the victim cache management strategy by monitoring load reusability and only allocating cache lines to loads with higher reusability. This optimization improves the victim cache efficiency. Our experimental results show that the RBGC achieves a 39.1% performance improvement with minimal hardware overhead compared to the baseline GPU architecture.

Journal ArticleDOI
30 Apr 2023
TL;DR: In this article , a hybrid algorithm for web caching using semantic similarity is developed using NGD (Normalized Google Distance) to determine the semantic similarity between cache objects, which resulted in better performance in comparison to other algorithms.
Abstract: Recently, most companies have been conducting their business through the cloud network, so increasing the speed of data access via the Internet has become very important. Therefore, the importance of cache algorithms lies in increasing the speed of data access. As we know, there are cache substitution algorithms that work well with the cache when applied to the processor cache, but they hardly work when applied to caching for web purposes. The reason for this discrepancy is that it is not intended to increase the complexity of this type of storage. Considering the challenges of cache usage in terms of the large discrepancy in file size and the heterogeneous model of user access to data in a web environment, especially with pages with dynamic content, there is a real need to develop 'web cache' algorithms. In this paper, a hybrid algorithm for web caching using semantic similarity is developed. The GDFS algorithm was developed using NGD (Normalized Google Distance) to determine the semantic similarity between cache objects, which resulted in better performance in comparison to other algorithms. The results showed that the web cache hybrid algorithm using semantic similarity increased the hit rate compared to the basic algorithms by up to 80.10%. The proposed hybrid algorithm was able to overcome the problem of the low byte hit rate in the GDFS algorithm. The improvement in the byte hit rate reached 65%. This indicates an increase in the byte hit rate. The results showed that the web cache hybrid algorithm using semantic similarity reduced the page load time using distance measured from Google compared to the page load time using other algorithms that do not use semantic similarity and do not use cache. The results showed that the web cache hybrid algorithm using semantic similarity reduced the page load time from 4.17 seconds using GDFS-Line to 2.16 seconds using GDFS-NGD.

Proceedings ArticleDOI
21 Jun 2023
TL;DR: In this article , a new metric, called cache miss distribution, is proposed to describe cache miss behavior over cache sets, for predicting cache Miss Ratio Curve (MRC) on commodity systems.
Abstract: The cache Miss Ratio Curve (MRC) serves a variety of purposes such as cache partitioning, application profiling and code tuning. In this work, we propose a new metric, called cache miss distribution, that describes cache miss behavior over cache sets, for predicting cache MRCs. Based on this metric, we present FLORIA, a software-based, online approach that approximates cache MRCs on commodity systems. By polluting a tunable number of cache lines in some selected cache sets using our designed microbenchmark, the cache miss distribution for the target workload is obtained via hardware performance counters with the support of precise event based sampling (PEBS). A model is developed to predict the MRC of the target workload based on its cache miss distribution. We evaluate FLORIA for systems consisting of a single application as well as a wide range of different workload mixes. Compared with the state-of-the-art approaches in predicting online MRCs, FLORIA achieves the highest average accuracy of 97.29% with negligible overhead. It also allows fast and accurate estimation of online MRC within 5ms, 20X faster than the state-of-the-art approaches. We also demonstrate that FLORIA can be applied to guiding cache partitioning for multiprogrammed workloads, helping to improve overall system performance.

Journal ArticleDOI
TL;DR: NCache as mentioned in this paper proposes a machine learning-based caching scheme to optimize both hit ratio and SSD performance, which is orthogonal to the existing caching schemes within the flash translation layer.
Abstract: Inside a solid-state disk (SSD), cache stores frequently accessed data to shorten the user-I/O response time and reduce the number of read/write operations in flash memory, thereby improving SSD performance and lifetime. Most existing cache schemes anchor in the spatiotemporal locality of I/O requests in workloads. In the face of a long-time workload, high performance and hit rate often get lost in these caching schemes. Flash memory-aware caching schemes trade hit ratio to prolong SSD lifetime. In this article, we advocate for a machine-learning-based caching scheme named NCache to optimize both hit ratio and SSD performance. In NCache, we construct a machine learning (i.e., ML) model to predict whether data are reaccessed before being evicted from the cache. The cache replacement scheme preferentially evicts data that would not be accessed in the cache. The cache space is conserved for valid data that are likely to be repeatedly accessed. A pipelined scheme is implemented to accelerate the ML model, alleviating the time-cost of NCache. A double-linked list boosts the data addressing and cache replacement process. NCache is orthogonal to the existing caching schemes within the flash translation layer. The results validate NCache under a handful of real-world enterprise traces. Taking prn_0 as an example, NCache reduces the response time of LRU, clean first LRU (CFLRU), GCaR_LRU, GCaR_CFLRU, and LCR by up to 15% with an average of 6.4%. The erase count is slashed by 16% at the maximum. Importantly, NCache is adroit at optimizing write amplification by up to 15.9%.

Proceedings ArticleDOI
01 Feb 2023
TL;DR: In this article , the cache is partitioned into two regions: H-cache and L-cache, which store samples of high importance and low importance respectively, and they use sample substitutability and dynamic packaging to improve the cache hit ratio and reduce the number of random I/Os.
Abstract: Fetching a large amount of DNN training data from storage systems incurs long I/O latency and fetch stalls of GPUs. Importance sampling in DNN training can reduce the amount of data computing on GPUs while maintaining a similar model accuracy. However, existing DNN training frameworks do not have a cache layer that reduces the number of data fetches and manages cached items according to sample importance, resulting in unnecessary data fetches, poor cache hit ratios, and random I/Os when importance sampling is used.In this paper, we design a new importance-sampling-informed cache, namely, iCache, to accelerate I/O bound DNN training jobs. iCache only fetches parts of samples instead of all samples in the dataset. The cache is partitioned into two regions: H-cache and L-cache, which store samples of high importance and low importance respectively. Rather than using recency or frequency, we manage data items in H-cache according to their corresponding sample importance. When there is a cache miss in L-cache, we use sample substitutability and dynamic packaging to improve the cache hit ratio and reduce the number of random I/Os. When multiple concurrent jobs access the same datasets in H-cache, we design a model to assign the relative importance values to cached samples to avoid cache thrashing, which may happen when there is no coordination among the concurrent training jobs. Our experimental results show that iCache has a negligible impact on training accuracy and speeds up the DNN training time by up to 2.0× compared to the state-of-the-art caching systems.

Posted ContentDOI
29 Jun 2023
TL;DR: AdaCache as mentioned in this paper proposes an adaptive cache block allocation scheme that allocates cache blocks based on the request size to achieve both good cache performance and low memory overhead for diverse cloud workloads with vastly different I/O patterns.
Abstract: NVMe SSD caching has demonstrated impressive capabilities in solving cloud block storage's I/O bottleneck and enhancing application performance in public, private, and hybrid cloud environments. However, traditional host-side caching solutions have several serious limitations. First, the cache cannot be shared across hosts, leading to low cache utilization. Second, the commonly-used fix-sized cache block allocation mechanism is unable to provide good cache performance with low memory overhead for diverse cloud workloads with vastly different I/O patterns. This paper presents AdaCache, a novel userspace disaggregated cache system that utilizes adaptive cache block allocation for cloud block storage. First, AdaCache proposes an innovative adaptive cache block allocation scheme that allocates cache blocks based on the request size to achieve both good cache performance and low memory overhead. Second, AdaCache proposes a group-based cache organization that stores cache blocks into groups to solve the fragmentation problem brought by variable-sized cache blocks. Third, AdaCache designs a two-level cache replacement policy that replaces cache blocks in both single blocks and groups to improve the hit ratio. Experimental results with real-world traces show that AdaCache can substantially improve I/O performance and reduce storage access caused by cache miss with a much lower memory usage compared to traditional fix-sized cache systems.

Journal ArticleDOI
TL;DR: In this article , a secure-aware partitioning guide architecture is proposed to improve performance and write endurance by removing the necessity of cache flushing, where the write counts are considered for the new status and no cache lines are evicted.
Abstract: Attackers of modern computer architecture found that cache access latency difference between cache hit and cache miss is a point where secure data are overlooked. To prevent such data leakage, cache partitioning technique is utilized for defenders via cache hit isolation. Although this approach is effective in increasing resistance against cache timing attack, it is not suitable for emerging memory system, which is based on non-volatile memories, because it overlooks the weaknesses of the write operations. This paper proposes a secure-aware partitioning guide architecture to improve performance and write endurance by removing the necessity of cache flushing. During changing cache partitioning status, the write counts are considered for the new status and no cache lines are evicted in the proposal. As a result, the lifetime is extended by 1.77 times and the penalty of cache flushing is saved by 7.8%.

Book ChapterDOI
Pedro C. Cavadas1
01 Jan 2023
TL;DR: In this article , a state-of-the-art adaptive dynamic cache line replacement strategy, LWIRR, was proposed by using the access pattern of memory workloads from different complex applications from CPU-2017 and CPU-2006 benchmark and simulated with our developed trace driven multi-core simulator with memory address traces for shared L3 cache configurations.
Abstract: Multi-core processors from different processor design companies such as Intel; AMD introduces shared cache memory architecture to improve the performance and better resource utilization. Most of the modern processors from Intel and AMD processor used traditional replacement techniques such as LRU (Least recently used), pseudo-LRU for the eviction of a cache line from the low level shared L3 cache. The shared last level cache L3 vary in cache capacity, associatively for state of art processors and it is required to use an adaptive dynamic replacement technique for better utilization of shared cache memory as the old replacement techniques lead to poor performance at shared last level cache with more memory intensive workloads with different access pattern. In this manuscript we have proposed a state of art replacement strategy LWIRR by using the access pattern of memory workloads from different complex applications from CPU-2017 and CPU-2006 benchmark and simulated with our developed trace driven multi-core simulator with number of memory address traces for shared L3 cache configurations. We reveal that our proposed replacement algorithm outperforms LRU and found that the LWIRR shows better performance in terms of cache hit rate, Instruction per cycle and execution time.

Journal ArticleDOI
TL;DR: In this paper , the authors propose a dynamic active and collaborative cache management (DAC) scheme, where the cache is composed of cold cache, hot cache, ghost cold cache and ghost hot cache.

Journal ArticleDOI
TL;DR: In this article , a RL-based hybrid caching strategy is proposed, where the routers work in a distributed manner and learn to pick the most suitable policy for caching a content, which decouples the caching policy selection from the admission logic used by the selected policy.

Proceedings ArticleDOI
20 Feb 2023
TL;DR: In this paper , the authors examined the Southern California Petabyte scale cache for a high-energy physics experiment and found that the cache removed 67.6% of file requests from the wide-area network and reduced the traffic volume on wide area network by 12.4% on average.
Abstract: Large scientific collaborations often have multiple scientists accessing the same set of files while doing different analyses, which create repeated accesses to the large amounts of shared data located far away. These data accesses have long latency due to distance and occupy the limited bandwidth available over the wide-area network. To reduce the wide-area network traffic and the data access latency, regional data storage caches have been installed as a new networking service. To study the effectiveness of such a cache system in scientific applications, we examine the Southern California Petabyte Scale Cache for a high-energy physics experiment. By examining about 3TB of operational logs, we show that this cache removed 67.6% of file requests from the wide-area network and reduced the traffic volume on wide-area network by 12. 3TB (or 35.4%) an average day. The reduction in the traffic volume (35.4%) is less than the reduction in file counts (67.6%) because the larger files are less likely to be reused. Due to this difference in data access patterns, the cache system has implemented a policy to avoid evicting smaller files when processing larger files. We also build a machine learning model to study the predictability of the cache behavior. Tests show that this model is able to accurately predict the cache accesses, cache misses, and network throughput, making the model useful for future studies on resource provisioning and planning.

Journal ArticleDOI
TL;DR: In this article , an aging-based Least Frequently Used (LFU) algorithm is proposed to fill the cache memory with the most frequently used data, and the cache block with the lowest age count and priority is eliminated first.
Abstract: In today's world of Business Intelligence (BI), fast and efficient access to data from Data Warehouses (DW) is crucial. With the increasing amount of Big Data, caching has become one of the most effective techniques for improving data access performance. DWs are widely used by organizations for managing and using data in Decision Support Systems (DSS). To optimize the performance of fetching data from DWs, various methods have been employed, and one of them is the Query Cache method. Our proposed work focuses on a cache-based mechanism that improves the performance of DWs in two ways. First, it reduces the execution time by directly accessing records from cache memory, and second, it saves cache memory space by eliminating non-frequently used data. Our goal is to fill the cache memory with the most frequently used data. To achieve this objective, we utilize an aging-based Least Frequently Used (LFU) algorithm that considers the size and frequency of data simultaneously. This algorithm manages the priority and expiry age of the data in cache memory by taking into account both the size and frequency of data. LFU assigns priorities and counts the age of data placed in the cache memory. The cache block entry with the lowest age count and priority is eliminated first. Ultimately, our proposed cache mechanism efficiently utilizes cache memory and significantly improves the performance of data access between the main DW and the business user query

Journal ArticleDOI
TL;DR: In this paper , the authors proposed mobile coded caching schemes to reduce network traffic in mobility scenarios, which achieved a lower cost on caching information uploading by first constructing caching patterns, and then assigning the caching patterns to users according to the graph coloring method and four color theorem.
Abstract: In coded caching, users cache pieces of files under a specific arrangement so that the server can satisfy their requests simultaneously in the broadcast scenario via eXclusive OR (XOR) operation and therefore reduce the amount of transmission data. However, when users' locations are changing, the uploading of caching information is frequent and extensive that the traffic increase outweighed the traffic reduction that the traditional coded caching achieved. In this paper, we propose mobile coded caching schemes to reduce network traffic in mobility scenarios, which achieve a lower cost on caching information uploading. In the cache placement phase, the proposed scheme first constructs caching patterns, and then assigns the caching patterns to users according to the graph coloring method and four color theorem in our centralized cache placement algorithm or randomly in our decentralized cache placement algorithm. Then users are divided into groups based on their caching patterns. As a benefit, when user movements occur, the types of caching pattern, rather than the whole caching information of which file pieces are cached, are uploaded. In the content delivery phase, XOR coded caching messages are reconstructed. Transmission data volume is derived to measure the performance of the proposed schemes. Numerical results show that the proposed schemes achieve great improvement in traffic offloading.

Journal ArticleDOI
TL;DR: In this article , a multi-antenna coded caching scheme for the shared cache setting with more diverse user-to-cache associations was proposed, where the number of antennas is not less than the minimum number of users connected to the least occupied cache.
Abstract: We consider the multi-antenna shared cache setup where the server is equipped with multiple transmit antennas and is connected to a set of users that are assisted with a smaller number of helper nodes. The helper nodes serve as caches that are shared among the users. Each user gets connected to exactly one helper cache. For this setting, under uncoded cache placement, an optimal multi-antenna coded caching scheme supporting all types of user-to-cache association is known in the literature which is applicable only when the number of antennas is at most the number of users connected to the least occupied cache. The other known multi-antenna shared caching schemes considered only uniform association of users to helper caches. In this work, we propose a multi-antenna coded caching scheme for the shared cache setting with more diverse user-to-cache associations where the number of antennas is not less than the number of users connected to the least occupied cache. Under certain scenarios, our scheme achieves the same performance exhibited by some optimal schemes.

Journal ArticleDOI
TL;DR: In this paper , an enhanced caching strategy is proposed, named Priority-based Content Popularity-Aware (PCPA) Caching Strategy, which is evaluated by comparing its performance with some of the novel NDN-based IoT caching strategies.
Abstract: Named Data Networking (NDN) is considered the future of Internet architecture, providing a realistic solution for data delivery using a caching module in an Internet of Things (IoT) based environment. However, a major challenge of the caching module is data redundancy, which decreases the overall caching performance by caching similar data at numerous locations in an NDN-based IoT scenario. Moreover, the latency and stretch are maximized due to high redundant caching operations. Several attempts have been made by the research community to provide an enhanced solution to overcome such issues. However, the caching module still requires efficient enhancement. This study provides critical insights into earlier caching strategies. To solve the problems of these caching strategies, an enhanced caching strategy is proposed, named Priority-based Content Popularity-Aware (PCPA) Caching Strategy, which is evaluated by comparing its performance with some of the novel NDN-based IoT caching strategies. The proposed caching strategy outperforms the comparing strategies in terms of latency, hop count, cache hit ratio and energy consumption.

Book ChapterDOI
01 Jan 2023
TL;DR: In this paper , a learning automata-based cache update policy has been designed in order to determine appropriate content to be cached in RSUs, which can significantly improve the average cache hit ratio, minimize latency, and enhance the quality of experience.
Abstract: AbstractIn vehicular ad hoc networks (VANETs), caching is a very promising technique to reduce the transmission burden and to improve the users’ Quality of Experience (QoE) in terms of latency. Increasing cache hit ratio is very important for delay sensitive applications. In this paper, average cache hit ratio maximization problem is proposed and formulated while taking into account the time-varying topology of network, erratic vehicular (user) mobility, varying requests and preferences of multiple users and the limited cache capacity of the Road Side Units (RSUs). A learning automata-based cache update policy has been designed in order to determine appropriate content to be cached in RSUs. The performance of the learning scheme-based caching policy has been evaluated using simulations and analysed in comparison with three other caching policies. Simulation results indicate that the learning-based caching policy can significantly improve the average cache hit ratio, minimize latency, and thus, enhance the Quality of Experience.

Posted ContentDOI
21 Feb 2023
TL;DR: In this article , a GPU shared L1 cache architecture with an aggregated tag array is proposed to reduce the resource contentions and take full advantage of inter-core locality by decouple and aggregate the tag arrays of multiple L1 caches.
Abstract: GPU shared L1 cache is a promising architecture while still suffering from high resource contentions. We present a GPU shared L1 cache architecture with an aggregated tag array that minimizes the L1 cache contentions and takes full advantage of inter-core locality. The key idea is to decouple and aggregate the tag arrays of multiple L1 caches so that the cache requests can be compared with all tag arrays in parallel to probe the replicated data in other caches. The GPU caches are only accessed by other GPU cores when replicated data exists, filtering out unnecessary cache accesses that cause high resource contentions. The experimental results show that GPU IPC can be improved by 12% on average for applications with a high inter-core locality.

Proceedings ArticleDOI
07 Jun 2023
TL;DR: In this article , the authors propose an analysis approach for shared caches using the least recently used (LRU) replacement policy, which leverages timing information to produce tight bounds on the worst-case interference.
Abstract: Caches are used to bridge the gap between main memory and the significantly faster processor cores. In multi-core architectures, the last-level cache is often shared between cores. However, sharing a cache causes inter-core interference to emerge. Concurrently running tasks will experience additional cache misses as the competing tasks issue interfering accesses and trigger the eviction of data contained in the shared cache. Thus, to compute a task’s worst-case execution time (WCET), a safe bound on the effects of inter-core cache interference has to be determined. In this paper, we propose a novel analysis approach for shared caches using the least recently used (LRU) replacement policy. The presented analysis leverages timing information to produce tight bounds on the worst-case interference. We describe how inter-core cache interference may be expressed as a function of time using event-arrival curves. Thus, by determining the maximal duration between subsequent accesses to a cache block, it is possible to bound the inter-core interference. This enables us to classify accesses as cache hits or potential misses. We implemented the analysis in a WCET analyzer and evaluated its performance for multi-core systems containing 2, 4, and 8 cores using shared caches from 4 KB to 32 KB. The analysis achieves significant improvements compared to a standard interference analysis with WCET reductions of up to 60%. The average WCET reduction is 9% for dual-core, 15% for quad-core, and 11% for octa-core systems. The analysis runtime overhead ranges from a factor of 4 × to 7 × compared to the baseline analysis.