scispace - formally typeset
Search or ask a question

Showing papers on "Cache coloring published in 2018"


Journal ArticleDOI
01 Feb 2018
TL;DR: It is shown that minimizing the retrieval cost corresponds to solving an online knapsack problem, and new dynamic policies inspired by simulated annealing are proposed, including DynqLRU, a variant of qLRU that significantly outperforms state-of-the-art policies.
Abstract: Cache policies to minimize the content retrieval cost have been studied through competitive analysis when the miss costs are additive and the sequence of content requests is arbitrary. More recently, a cache utility maximization problem has been introduced, where contents have stationary popularities and utilities are strictly concave in the hit rates. This paper bridges the two formulations, considering linear costs and content popularities. We show that minimizing the retrieval cost corresponds to solving an online knapsack problem, and we propose new dynamic policies inspired by simulated annealing, including DynqLRU, a variant of qLRU. We prove that DynqLRU asymptotically asymptotic converges to the optimum under the characteristic time approximation. In a real scenario, popularities vary over time and their estimation is very difficult. DynqLRU does not require popularity estimation, and our realistic, trace-driven evaluation shows that it significantly outperforms state-of-the-art policies, with up to 45% cost reduction.

48 citations


Proceedings ArticleDOI
01 Feb 2018
TL;DR: This paper addresses the problem of providing spatial and temporal isolation between execution domains in a hypervisor running on an ARM multicore platform by carefully managing the two primary shared hardware resources: the last-level cache (LLC) and the DRAM memory controller.
Abstract: This paper addresses the problem of providing spatial and temporal isolation between execution domains in a hypervisor running on an ARM multicore platform. Isolation is achieved by carefully managing the two primary shared hardware resources of today's multicore platforms: the last-level cache (LLC) and the DRAM memory controller. The XVISOR open-source hypervisor and the ARM Cortex A7 platform have been used as reference systems for the purpose of this work. Spatial partitioning on the LLC has been implemented by means of cache coloring, which has been tightly integrated with the ARM virtualization extensions (ARM-VE) to deal with the memory virtualization capabilities offered by a two-stage memory management unit (MMU). Temporal isolation on the DRAM controller has been implemented by realizing a memory bandwidth reservation mechanism, which has been combined with the scheduling logic of the hypervisor. An extensive experimental evaluation has been performed on the popular Raspberry Pi 2 board, showing the effectiveness of the implemented solutions on a case-study composed of multiple Linux domains running state-of-the-art benchmarks.

31 citations


Journal ArticleDOI
TL;DR: In order to minimize the cache miss cost in the hybrid main memory, a cost aware cache replacement policy (CACRP) is proposed that reduces the number of cache misses from NVM and improves the cache performance for a hybrid memory system.
Abstract: Fog computing requires a large main memory capacity to decrease latency and increase the Quality of Service (QoS). However, dynamic random access memory (DRAM), the commonly used random access memory, cannot be included into a fog computing system due to its high consumption of power. In recent years, non-volatile memories (NVM) such as Phase-Change Memory (PCM) and Spin-transfer torque RAM (STT-RAM) with their low power consumption have emerged to replace DRAM. Moreover, the currently proposed hybrid main memory, consisting of both DRAM and NVM, have shown promising advantages in terms of scalability and power consumption. However, the drawbacks of NVM, such as long read/write latency give rise to potential problems leading to asymmetric cache misses in the hybrid main memory. Current last level cache (LLC) policies are based on the unified miss cost, and result in poor performance in LLC and add to the cost of using NVM. In order to minimize the cache miss cost in the hybrid main memory, we prop...

25 citations


Proceedings ArticleDOI
TL;DR: In this paper, the authors defined four important characteristics of a suitable eviction policy for information centric networks (ICN) and proposed a new eviction scheme which is well suitable for ICN type of cache networks.
Abstract: The information centric networks (ICN) can be viewed as a network of caches. Conversely, ICN type of cache networks has distinctive features e.g, contents popularity, usability time of content and other factors inflicts some diverse requirements for cache eviction policies. In this paper we defined four important characteristics of a suitable eviction policy for ICN. We analysed well known eviction policies in view of defined characteristics. Based upon analysis we propose a new eviction scheme which is well suitable for ICN type of cache networks.

20 citations


Journal ArticleDOI
TL;DR: This paper gives an analytical method to find the miss rate of L2 cache for various configurations from the RD profile with respect to L1 cache and considers all three types of cache inclusion policies namely (i) Strictly Inclusive, (ii) Mutually Exclusive and (iii) Non-Inclusive Non-Exclusive.
Abstract: Reuse distance is an important metric for analytical estimation of cache miss rate. To find the miss rate of a particular cache, the reuse distance profile has to be measured for that particular level and configuration of the cache. Significant amount of simulation time and overhead can be reduced if we can find the miss rate of higher level cache like L2 cache from the RD profile with respect to a lower level cache (i.e., cache that is closer to the processor) such as L1. The objective of this paper is to give an analytical method to find the miss rate of L2 cache for various configurations from the RD profile with respect to L1 cache. We consider all three types of cache inclusion policies namely (i) Strictly Inclusive, (ii) Mutually Exclusive and (iii) Non-Inclusive Non-Exclusive policy. We first prove some general results relating the RD profile of L1 cache to that of L2 cache. We use probabilistic analysis for our derivations. We validate our model against simulations, using the multi-core simulator Sniper with the PARSEC and the SPLASH benchmark suites.

17 citations


Journal ArticleDOI
TL;DR: This article provides key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache), and designs efficient heuristics for Amdahl applications.
Abstract: Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multi-core machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are: (i) which proportion of cache and (ii) how many processors should be given to each application? In this paper, we provide answers to (i) and (ii) for Amdahl applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for Amdahl applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed.

6 citations


Journal ArticleDOI
TL;DR: This work motivates this work by showcasing that cache partitioning, loop tiling, data array layouts, shared cache architecture details, and the memory reuse patterns of the executing tasks must be addressed together as one problem, when a (near)-optimal solution is requested.
Abstract: One of the biggest challenges in multicore platforms is shared cache management, especially for data-dominant applications. Two commonly used approaches for increasing shared cache utilization are cache partitioning and loop tiling. However, state-of-the-art compilers lack efficient cache partitioning and loop tiling methods for two reasons. First, cache partitioning and loop tiling are strongly coupled together, and thus addressing them separately is simply not effective. Second, cache partitioning and loop tiling must be tailored to the target shared cache architecture details and the memory characteristics of the corunning workloads.To the best of our knowledge, this is the first time that a methodology provides (1) a theoretical foundation in the above-mentioned cache management mechanisms and (2) a unified framework to orchestrate these two mechanisms in tandem (not separately). Our approach manages to lower the number of main memory accesses by an order of magnitude keeping at the same time the number of arithmetic/addressing instructions to a minimal level. We motivate this work by showcasing that cache partitioning, loop tiling, data array layouts, shared cache architecture details (i.e., cache size and associativity), and the memory reuse patterns of the executing tasks must be addressed together as one problem, when a (near)-optimal solution is requested. To this end, we present a search space exploration analysis where our proposal is able to offer a vast deduction in the required search space.

5 citations


Proceedings ArticleDOI
10 Oct 2018
TL;DR: This work shows how one can model intra- and inter-task cache interference in a way that allows balancing their respective contribution to tasks worst-case response times and proposes a technique based on cache coloring to improve task set schedulability.
Abstract: Caches help reduce the average execution time of tasks due to their fast operational speeds. However, caches may also severely degrade the timing predictability of the system due to intra- and inter-task cache interference. Intra-task cache interference occurs if the memory footprint of a task is larger than the allocated cache space or when two memory entries of that task are mapped to the same space in cache. Inter-task cache interference occurs when memory entries of two or more distinct tasks use the same cache space. State-of-the-art analysis focusing on bounding cache interference or reducing it by means of partitioning and by optimizing task layout in memory either focus on intra- or inter-task cache interference and do not exploit the fact that both intra- and inter-task cache interference can be interrelated.In this work, we show how one can model intra- and inter-task cache interference in a way that allows balancing their respective contribution to tasks worst-case response times. Since the placement of tasks in memory and their respective cache footprint determine the intra- and inter-task interference that tasks may suffer, we propose a technique based on cache coloring to improve task set schedulability. Experimental evaluations performed using Malardalen benchmarks show that our approach results in up to 13% higher task set schedulability than state-of-the-art approaches.

5 citations


Journal ArticleDOI
TL;DR: The simulation results show that the proposed CCSP scheme improves significantly the cache effectiveness and the network performances, by improving data availability and reducing both overall network load and latencies perceived by end users.
Abstract: In wireless mobile Ad Hoc networks, cooperative cache management is considered as an efficient technique to increase data availability and improve access latency. This technique is based on coordination and sharing of cached data between nodes belonging to the same area. In this paper, we studied the cooperative cache management strategies. This has enabled us to propose a collaborative cache management scheme for mobile Ad Hoc networks, based on service cache providers (SCP), called cooperative caching based on service providers (CCSP). The proposed scheme enabled the election of some SCPs mobile nodes, which receive cache’s summaries of neighboring nodes. Thus, nodes belonging to the same zone can locate easily cached documents of that area. The election mechanism used in this approach is executed periodically to ensure load balancing. We further provided an evaluation of the proposed solution, in terms of request hit rate, byte hit rate and time gains. Compared with other caching management schemes, the simulation results show that the proposed CCSP scheme improves significantly the cache effectiveness and the network performances. This is achieved by improving data availability and reducing both overall network load and latencies perceived by end users.

5 citations


Journal ArticleDOI
TL;DR: PhLock as mentioned in this paper leverages an application's varying runtime characteristics to dynamically select the locked memory contents to optimize cache energy consumption, which is a popular cache optimization that loads and retains/locks selected memory contents from an executing application into the cache to increase cache's predictability.
Abstract: Caches are commonly used to bridge the processor-memory performance gap in embedded systems. Since embedded systems typically have stringent design constraints imposed by physical size, battery capacity, and real-time deadlines much research focuses on cache optimizations, such as improved performance and/or reduced energy consumption. Cache locking is a popular cache optimization that loads and retains/locks selected memory contents from an executing application into the cache to increase the cache’s predictability. Previous work has shown that cache locking also has the potential to improve cache energy consumption. In this paper, we introduce phase-based cache locking, PhLock , which leverages an application’s varying runtime characteristics to dynamically select the locked memory contents to optimize cache energy consumption. Using a variety of applications from the SPEC2006 and MiBench benchmark suites, experimental results show that PhLock is promising for reducing both the instruction and data caches’ energy consumption. As compared to a nonlocking cache, PhLock reduced the instruction and data cache energy consumption by an average of 5% and 39%, respectively, for SPEC2006 applications, and by 75% and 14%, respectively, for MiBench benchmarks.

5 citations


Journal ArticleDOI
TL;DR: It is proved that any non-redundant cache placement strategy can be transformed, with no additional cost, to a strategy in which at every node, each file is either cached completely or not cached at all.
Abstract: Considering cache enabled networks, optimal content placement minimizing the total cost of communication in such networks is studied, leading to a surprising fundamental 0–1 law for non-redundant cache placement strategies, where the total cache sizes associated with each file does not exceed the file size. In other words, for such strategies, we prove that any non-redundant cache placement strategy can be transformed, with no additional cost, to a strategy in which at every node, each file is either cached completely or not cached at all. Moreover, we obtain a sufficient condition under which the optimal cache placement strategy is in fact non-redundant. This result together with the 0–1 law reveals that situations exist, where optimal content placement is achieved just by uncoded placement of whole files in caches.

Journal ArticleDOI
TL;DR: A replacement policy adaptable miss curve estimation (RME) which estimates dynamic workload patterns according to any arbitrary replacement policy and to given applications with low overhead is proposed which supports the efficiency of RME and shows that RME-based cache partitioning cooperated with high-performance replacement policies can minimize both inter- and intra-application interference successfully.
Abstract: Cache replacement policies and cache partitioning are well-known cache management techniques which aim to eliminate inter- and intra-application contention caused by co-running applications, respectively. Since replacement policies can change applications’ behavior on a shared last-level cache, they have a massive impact on cache partitioning. Furthermore, cache partitioning determines the capacity allocated to each application affecting incorporated replacement policy. However, their interoperability has not been thoroughly explored. Since existing cache partitioning methods are tailored to specific replacement policies to reduce overheads for characterization of applications’ behavior, they may lead to suboptimal partitioning results when incorporated with the up-to-date replacement policies. In cache partitioning, miss curve estimation is a key component to relax this restriction which can reflect the dependency between a replacement policy and cache partitioning on partitioning decision. To tackle this issue, we propose a replacement policy adaptable miss curve estimation (RME) which estimates dynamic workload patterns according to any arbitrary replacement policy and to given applications with low overhead. In addition, RME considers asymmetry of miss latency by miss type, thus the impact of miss curve on cache partitioning can be reflected more accurately. The experimental results support the efficiency of RME and show that RME-based cache partitioning cooperated with high-performance replacement policies can minimize both inter- and intra-application interference successfully.

Book ChapterDOI
07 Jun 2018
TL;DR: The need for accurate memory-centric scheduling mechanisms for guaranteeing prioritized memory accesses to Real-Time safety-related components of the system is highlighted and isolation at timing and spatial level can be achieved by managing the lines that can be evicted in the cache.
Abstract: With the increasing use of multi-core platforms in safety-related domains, aircraft system integrators and authorities exhibit a concern about the impact of concurrent access to shared-resources in the Worst-Case Execution Time (WCET). This paper highlights the need for accurate memory-centric scheduling mechanisms for guaranteeing prioritized memory accesses to Real-Time safety-related components of the system. We implemented a software technique called cache coloring that demonstrates that isolation at timing and spatial level can be achieved by managing the lines that can be evicted in the cache. In order to show the effectiveness of this technique, the timing properties of a real application are considered as a use case, this application is made of parallel tasks that show different trade-offs between computation and memory loads.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed self-adaptive LLC scheduling scheme can improve the multi-programmed workload performance by up to 30% compared with the state-of-the-art methods.
Abstract: With the emerging of 3D-stacking technology, the dynamic random-access memoryDRAM) can be stacked on chips to architect the DRAM last level cacheLLC). Compared with static random-access memorySRAM), DRAM is larger but slower. In the existing research papers, a lot of work has been devoted to improving the workload performance using SRAM and stacked DRAM together, ranging from SRAM structure improvement, to optimizing cache tag and data access. Instead, little attention has been paid to designing an LLC scheduling scheme for multi-programmed workloads with different memory footprints. Motivated by this, we propose a self-adaptive LLC scheduling scheme, which allows us to utilize SRAM and 3D-stacked DRAM efficiently, achieving better workload performance. This scheduling scheme employs (1) an evaluation unit, which is used to probe and evaluate the cache information during the process of programs being executed; and linebreak (2) an implementation unit, which is used to self-adaptively choose SRAM or DRAM. To make the scheduling scheme work correctly, we develop a data migration policy. We conduct extensive experiments to evaluate the performance of our proposed scheme. Experimental results show that our method can improve the multi-programmed workload performance by up to 30% compared with the state-of-the-art methods.

Patent
13 Mar 2018
TL;DR: In this article, a cache-coloring memory allocation method and device for search trees is presented, where the memory which can be mapped to the corresponding coloring colors is allocated when the memory is allocated for each layer of nodes according to the colors corresponding to each layer in the second search tree, thus an effect that the different layers of nodes cannot mutually compete for the cache can be guaranteed, and thus search performance is improved.
Abstract: The embodiment of the invention provides a cache-coloring memory allocation method and device for search trees. The method includes: constructing the first search tree; determining that the first search tree has N layers of nodes; acquiring the number of the nodes which each layer in the N layers of nodes has; allocating corresponding colors in coloring colors of a memory for each layer of nodes according to the number of the nodes which each layer in the N layers of nodes has, wherein the corresponding colors allocated for all the layers of nodes are mutually different; and generating the second search tree after cache coloring according to the colors corresponding to each layer of nodes and the memory corresponding to the colors. Only the memory which can be mapped to the corresponding coloring colors is allocated when the memory is allocated for each layer of nodes according to the colors corresponding to each layer of nodes in the second search tree, mapping relationships exist between the memory and a Cache, thus an effect that the different layers of nodes cannot mutually compete for the Cache can be guaranteed, and thus search performance is improved.

Journal ArticleDOI
TL;DR: This work addresses the optimization of the whole memory subsystem with three approaches integrated as a single methodology, and simplifies the design and evaluation process of general-purpose and customized dynamic memory manager, in the main memory.

Journal ArticleDOI
TL;DR: The results suggest that variable cache line size can result in better performance and can also conserve power, and present runtime cache utilization, as well as, conventional performance metrics that illustrate a holistic understanding of cache behavior.
Abstract: Cache has long been used to minimize the latency of main memory accesses by storing frequently used data near the processor. Processor performance depends on the underlying cache performance. Therefore, significant research has been done to identify the most crucial metrics of cache performance. Although the majority of research focuses on measuring cache hit rates and data movement as the primary cache performance metrics, cache utilization is significantly important. We investigate the application’s locality using cache utilization metrics. Furthermore, we present cache utilization and traditional cache performance metrics as the program progresses providing detailed insights into the dynamic application behavior on parallel applications from four benchmark suites running on multiple cores. We explore cache utilization for APEX, Mantevo, NAS, and PARSEC, mostly scientific benchmark suites. Our results indicate that 40% of the data bytes in a cache line are accessed at least once before line eviction. Also, on average a byte is accessed two times before the cache line is evicted for these applications. Moreover, we present runtime cache utilization, as well as, conventional performance metrics that illustrate a holistic understanding of cache behavior. To facilitate this research, we build a memory simulator incorporated into the Structural Simulation Toolkit (Rodrigues et al. in SIGMETRICS Perform Eval Rev 38(4):37–42, 2011). Our results suggest that variable cache line size can result in better performance and can also conserve power.

Journal ArticleDOI
TL;DR: An efficient LLC management scheme that makes two groups at runtime without using any offline profiling data on containers, and suggests that the performance of a normal container can be improved by up to 40% in the case of using this proposed scheme.
Abstract: In contrast to the hypervisor-based virtualization method, the container-based scheme does not incur the overhead required by virtual machines since it requires neither a fully abstract hardware stack nor separate guest operating systems (OSes). In this virtualization method, the host OS controls the accesses of the containers to hardware resources. One container can thus be provided with resources such as CPU, memory and network, expectedly isolated from the others. However, due to the lack of architectural support, the last-level cache (LLC) is not utilized in an isolated manner, and thus, it is shared by all containers in the same cloud infrastructure. If a workload of a container leads to cache pollution, it negatively affects the performance of other workloads. To address this problem, we propose an efficient LLC management scheme. By monitoring the memory access pattern, the indirect LLC usage pattern of a container can be figured out. Then, our proposed scheme makes two groups at runtime without using any offline profiling data on containers. The first group is made up of cache-thrashing containers, which fill up the LLC without any temporal locality of data, and the second one consists of normal ones. For isolation, the two separate groups use different partitions of the LLC by the OS-based page coloring method. Our experimental study suggests that the performance of a normal container can be improved by up to 40% in the case of using our proposed scheme.

Book ChapterDOI
01 Jan 2018
TL;DR: The proposed cache reuse replacement policy manages cache blocks by separating reused cache blocks and thrashing cache blocks, which can increase IPC by up to 4.4% compared to the conventional GPU architecture.
Abstract: The performance of computing systems has been improved significantly for several decades. However, increasing the throughput of recent CPUs (Central Processing Units) is restricted by power consumption and thermal issues. GPUs (Graphics Processing Units) are recognized as efficient computing platform with powerful hardware resources to support CPUs in computing systems. Unlike CPUs, there is a large number of CUDA (Compute Unified Device Architecture) cores in GPUs, hence, some cache blocks are referenced many times repeatedly. If those cache blocks reside in the cache for long time, hit rates can be improved. On the other hand, many cache blocks are referenced only once and never referenced again in the cache. These blocks waste cache memory space, resulting in reduced GPU performance. Conventional LRU replacement policy cannot consider the problems from non-reused cache blocks and frequently-reused cache blocks. In this paper, a new cache replacement policy based on the reuse pattern of cache blocks is proposed. The proposed cache replacement policy manages cache blocks by separating reused cache blocks and thrashing cache blocks. According to simulation results, the proposed cache reuse replacement policy can increase IPC by up to 4.4% compared to the conventional GPU architecture.

Proceedings ArticleDOI
TL;DR: In this paper, the cache memory optimization techniques have been adopted to the GPGPU's cache memory to identify rare performance improvement techniques compared to GPU's best practices, such as blocking, loop fusion, array merging and array transposition.
Abstract: General Purpose Graphic Processing Unit(GPGPU) is used widely for achieving high performance or high throughput in parallel programming. This capability of GPGPUs is very famous in the new era and mostly used for scientific computing which requires more processing power than normal personal computers. Therefore, most of the programmers, researchers and industry use this new concept for their work. However, achieving high-performance or high-throughput using GPGPUs are not an easy task compared with conventional programming concepts in the CPU side. In this research, the CPU's cache memory optimization techniques have been adopted to the GPGPU's cache memory to identify rare performance improvement techniques compared to GPGPU's best practices. The cache optimization techniques of blocking, loop fusion, array merging and array transpose were tested on GPGPUs for finding suitability of these techniques. Finally, we identified that some of the CPU cache optimization techniques go well with the cache memory system of the GPGPU and shows performance improvements while some others show the opposite effect on the GPGPUs compared with the CPUs.

Patent
12 Jan 2018
Abstract: The invention provides a method for dividing last-level shared cache. The method comprises the steps that the optimal cache in the running process of each processor core is determined; based on the optimal cache, the page coloring number to be allocated of each processor core is determined; based on the optimal cache and the page coloring number, the cache line number to be allocated of each processor core is calculated; in descending order of the page coloring numbers and the cache line numbers corresponding to the processor cores, the last-level shared cache is divided. Two-dimensional division of the last-level cache is achieved based on the page coloring numbers and the cache line numbers, so that the division granularity is refined, and the expansivity is better. The invention furtherprovides a system for dividing the last-level shared cache, and the system has the above advantages.