Showing papers on "Cache coloring published in 2018"

PDF

Open Access

Journal Article•DOI•

Cache Policies for Linear Utility Maximization

[...]

Giovanni Neglia¹, Damiano Carra², Pietro Michiardi³•Institutions (3)

French Institute for Research in Computer Science and Automation¹, University of Verona², Institut Eurécom³

01 Feb 2018

TL;DR: It is shown that minimizing the retrieval cost corresponds to solving an online knapsack problem, and new dynamic policies inspired by simulated annealing are proposed, including DynqLRU, a variant of qLRU that significantly outperforms state-of-the-art policies.

...read moreread less

Abstract: Cache policies to minimize the content retrieval cost have been studied through competitive analysis when the miss costs are additive and the sequence of content requests is arbitrary. More recently, a cache utility maximization problem has been introduced, where contents have stationary popularities and utilities are strictly concave in the hit rates. This paper bridges the two formulations, considering linear costs and content popularities. We show that minimizing the retrieval cost corresponds to solving an online knapsack problem, and we propose new dynamic policies inspired by simulated annealing, including DynqLRU, a variant of qLRU. We prove that DynqLRU asymptotically asymptotic converges to the optimum under the characteristic time approximation. In a real scenario, popularities vary over time and their estimation is very difficult. DynqLRU does not require popularity estimation, and our realistic, trace-driven evaluation shows that it significantly outperforms state-of-the-art policies, with up to 45% cost reduction.

...read moreread less

48 citations

Proceedings Article•DOI•

Supporting temporal and spatial isolation in a hypervisor for ARM multicore platforms

[...]

Paolo Modica¹, Alessandro Biondi¹, Giorgio Buttazzo¹, Anup Patel•Institutions (1)

Sant'Anna School of Advanced Studies¹

01 Feb 2018

TL;DR: This paper addresses the problem of providing spatial and temporal isolation between execution domains in a hypervisor running on an ARM multicore platform by carefully managing the two primary shared hardware resources: the last-level cache (LLC) and the DRAM memory controller.

...read moreread less

Abstract: This paper addresses the problem of providing spatial and temporal isolation between execution domains in a hypervisor running on an ARM multicore platform. Isolation is achieved by carefully managing the two primary shared hardware resources of today's multicore platforms: the last-level cache (LLC) and the DRAM memory controller. The XVISOR open-source hypervisor and the ARM Cortex A7 platform have been used as reference systems for the purpose of this work. Spatial partitioning on the LLC has been implemented by means of cache coloring, which has been tightly integrated with the ARM virtualization extensions (ARM-VE) to deal with the memory virtualization capabilities offered by a two-stage memory management unit (MMU). Temporal isolation on the DRAM controller has been implemented by realizing a memory bandwidth reservation mechanism, which has been combined with the scheduling logic of the hypervisor. An extensive experimental evaluation has been performed on the popular Raspberry Pi 2 board, showing the effectiveness of the implemented solutions on a case-study composed of multiple Linux domains running state-of-the-art benchmarks.

...read moreread less

31 citations

Journal Article•DOI•

Cost aware cache replacement policy in shared last-level cache for hybrid memory based fog computing

[...]

Gangyong Jia¹, Guangjie Han², Hao Wang², Feng Wang•Institutions (2)

Hangzhou Dianzi University¹, Hohai University²

21 Apr 2018-Enterprise Information Systems

TL;DR: In order to minimize the cache miss cost in the hybrid main memory, a cost aware cache replacement policy (CACRP) is proposed that reduces the number of cache misses from NVM and improves the cache performance for a hybrid memory system.

...read moreread less

Abstract: Fog computing requires a large main memory capacity to decrease latency and increase the Quality of Service (QoS). However, dynamic random access memory (DRAM), the commonly used random access memory, cannot be included into a fog computing system due to its high consumption of power. In recent years, non-volatile memories (NVM) such as Phase-Change Memory (PCM) and Spin-transfer torque RAM (STT-RAM) with their low power consumption have emerged to replace DRAM. Moreover, the currently proposed hybrid main memory, consisting of both DRAM and NVM, have shown promising advantages in terms of scalability and power consumption. However, the drawbacks of NVM, such as long read/write latency give rise to potential problems leading to asymmetric cache misses in the hybrid main memory. Current last level cache (LLC) policies are based on the unified miss cost, and result in poor performance in LLC and add to the cost of using NVM. In order to minimize the cache miss cost in the hybrid main memory, we prop...

...read moreread less

25 citations

Proceedings Article•DOI•

Time Aware Least Recent Used (TLRU) Cache Management Policy in ICN

[...]

Muhammad Bilal¹, Shin-Gak Kang¹•Institutions (1)

Electronics and Telecommunications Research Institute¹

01 Jan 2018-arXiv: Networking and Internet Architecture

TL;DR: In this paper, the authors defined four important characteristics of a suitable eviction policy for information centric networks (ICN) and proposed a new eviction scheme which is well suitable for ICN type of cache networks.

...read moreread less

Abstract: The information centric networks (ICN) can be viewed as a network of caches. Conversely, ICN type of cache networks has distinctive features e.g, contents popularity, usability time of content and other factors inflicts some diverse requirements for cache eviction policies. In this paper we defined four important characteristics of a suitable eviction policy for ICN. We analysed well known eviction policies in view of defined characteristics. Based upon analysis we propose a new eviction scheme which is well suitable for ICN type of cache networks.

...read moreread less

20 citations

Journal Article•DOI•

Analytical Miss Rate Calculation of L2 Cache from the RD Profile of L1 Cache

[...]

Jasmine Madonna Sabarimuthu¹, T. G. Venkatesh²•Institutions (2)

North Carolina State University¹, Indian Institutes of Technology²

01 Jan 2018-IEEE Transactions on Computers

TL;DR: This paper gives an analytical method to find the miss rate of L2 cache for various configurations from the RD profile with respect to L1 cache and considers all three types of cache inclusion policies namely (i) Strictly Inclusive, (ii) Mutually Exclusive and (iii) Non-Inclusive Non-Exclusive.

...read moreread less

Abstract: Reuse distance is an important metric for analytical estimation of cache miss rate. To find the miss rate of a particular cache, the reuse distance profile has to be measured for that particular level and configuration of the cache. Significant amount of simulation time and overhead can be reduced if we can find the miss rate of higher level cache like L2 cache from the RD profile with respect to a lower level cache (i.e., cache that is closer to the processor) such as L1. The objective of this paper is to give an analytical method to find the miss rate of L2 cache for various configurations from the RD profile with respect to L1 cache. We consider all three types of cache inclusion policies namely (i) Strictly Inclusive, (ii) Mutually Exclusive and (iii) Non-Inclusive Non-Exclusive policy. We first prove some general results relating the RD profile of L1 cache to that of L2 cache. We use probabilistic analysis for our derivations. We validate our model against simulations, using the multi-core simulator Sniper with the PARSEC and the SPLASH benchmark suites.

...read moreread less

17 citations

Journal Article•DOI•

Co-scheduling Amdahl applications on cache-partitioned systems

[...]

Guillaume Aupy, Anne Benoit, Sicheng Dai¹, Loïc Pottier, Padma Raghavan², Yves Robert, Manu Shantharam³ - Show less +3 more•Institutions (3)

East China Normal University¹, Vanderbilt University², San Diego Supercomputer Center³

01 Jan 2018-International Journal of High Performance Computing Applications

TL;DR: This article provides key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache), and designs efficient heuristics for Amdahl applications.

...read moreread less

Abstract: Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multi-core machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are: (i) which proportion of cache and (ii) how many processors should be given to each application? In this paper, we provide answers to (i) and (ii) for Amdahl applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for Amdahl applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed.

...read moreread less

6 citations

Journal Article•DOI•

Combining Software Cache Partitioning and Loop Tiling for Effective Shared Cache Management

[...]

Kelefouras Vasilios¹, Keramidas Georgios¹, Voros Nikolaos¹•Institutions (1)

American Hotel & Lodging Educational Institute¹

22 May 2018-ACM Transactions in Embedded Computing Systems

TL;DR: This work motivates this work by showcasing that cache partitioning, loop tiling, data array layouts, shared cache architecture details, and the memory reuse patterns of the executing tasks must be addressed together as one problem, when a (near)-optimal solution is requested.

...read moreread less

Abstract: One of the biggest challenges in multicore platforms is shared cache management, especially for data-dominant applications. Two commonly used approaches for increasing shared cache utilization are cache partitioning and loop tiling. However, state-of-the-art compilers lack efficient cache partitioning and loop tiling methods for two reasons. First, cache partitioning and loop tiling are strongly coupled together, and thus addressing them separately is simply not effective. Second, cache partitioning and loop tiling must be tailored to the target shared cache architecture details and the memory characteristics of the corunning workloads.To the best of our knowledge, this is the first time that a methodology provides (1) a theoretical foundation in the above-mentioned cache management mechanisms and (2) a unified framework to orchestrate these two mechanisms in tandem (not separately). Our approach manages to lower the number of main memory accesses by an order of magnitude keeping at the same time the number of arithmetic/addressing instructions to a minimal level. We motivate this work by showcasing that cache partitioning, loop tiling, data array layouts, shared cache architecture details (i.e., cache size and associativity), and the memory reuse patterns of the executing tasks must be addressed together as one problem, when a (near)-optimal solution is requested. To this end, we present a search space exploration analysis where our proposal is able to offer a vast deduction in the required search space.

...read moreread less

5 citations

Proceedings Article•DOI•

Trading Between Intra- and Inter-Task Cache Interference to Improve Schedulability

[...]

Syed Aftab Rashid¹, Geoffrey Nelissen¹, Eduardo Tovar¹•Institutions (1)

Polytechnic Institute of Porto¹

10 Oct 2018

TL;DR: This work shows how one can model intra- and inter-task cache interference in a way that allows balancing their respective contribution to tasks worst-case response times and proposes a technique based on cache coloring to improve task set schedulability.

...read moreread less

Abstract: Caches help reduce the average execution time of tasks due to their fast operational speeds. However, caches may also severely degrade the timing predictability of the system due to intra- and inter-task cache interference. Intra-task cache interference occurs if the memory footprint of a task is larger than the allocated cache space or when two memory entries of that task are mapped to the same space in cache. Inter-task cache interference occurs when memory entries of two or more distinct tasks use the same cache space. State-of-the-art analysis focusing on bounding cache interference or reducing it by means of partitioning and by optimizing task layout in memory either focus on intra- or inter-task cache interference and do not exploit the fact that both intra- and inter-task cache interference can be interrelated.In this work, we show how one can model intra- and inter-task cache interference in a way that allows balancing their respective contribution to tasks worst-case response times. Since the placement of tasks in memory and their respective cache footprint determine the intra- and inter-task interference that tasks may suffer, we propose a technique based on cache coloring to improve task set schedulability. Experimental evaluations performed using Malardalen benchmarks show that our approach results in up to 13% higher task set schedulability than state-of-the-art approaches.

...read moreread less

5 citations

Journal Article•DOI•

Improving Cache Effectiveness Based on Cooperative Cache Management in MANETs

[...]

Ali Larbi, Louiza Bouallouche-Medjkoune, Djamil Aïssani

01 Feb 2018-Wireless Personal Communications

TL;DR: The simulation results show that the proposed CCSP scheme improves significantly the cache effectiveness and the network performances, by improving data availability and reducing both overall network load and latencies perceived by end users.

...read moreread less

Abstract: In wireless mobile Ad Hoc networks, cooperative cache management is considered as an efficient technique to increase data availability and improve access latency. This technique is based on coordination and sharing of cached data between nodes belonging to the same area. In this paper, we studied the cooperative cache management strategies. This has enabled us to propose a collaborative cache management scheme for mobile Ad Hoc networks, based on service cache providers (SCP), called cooperative caching based on service providers (CCSP). The proposed scheme enabled the election of some SCPs mobile nodes, which receive cache’s summaries of neighboring nodes. Thus, nodes belonging to the same zone can locate easily cached documents of that area. The election mechanism used in this approach is executed periodically to ensure load balancing. We further provided an evaluation of the proposed solution, in terms of request hit rate, byte hit rate and time gains. Compared with other caching management schemes, the simulation results show that the proposed CCSP scheme improves significantly the cache effectiveness and the network performances. This is achieved by improving data availability and reducing both overall network load and latencies perceived by end users.

...read moreread less

5 citations

Journal Article•DOI•

PhLock: A Cache Energy Saving Technique Using Phase-Based Cache Locking

[...]

Tosiron Adegbija¹, Ann Gordon-Ross²•Institutions (2)

University of Arizona¹, University of Florida²

01 Jan 2018-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: PhLock as mentioned in this paper leverages an application's varying runtime characteristics to dynamically select the locked memory contents to optimize cache energy consumption, which is a popular cache optimization that loads and retains/locks selected memory contents from an executing application into the cache to increase cache's predictability.

...read moreread less

Abstract: Caches are commonly used to bridge the processor-memory performance gap in embedded systems. Since embedded systems typically have stringent design constraints imposed by physical size, battery capacity, and real-time deadlines much research focuses on cache optimizations, such as improved performance and/or reduced energy consumption. Cache locking is a popular cache optimization that loads and retains/locks selected memory contents from an executing application into the cache to increase the cache’s predictability. Previous work has shown that cache locking also has the potential to improve cache energy consumption. In this paper, we introduce phase-based cache locking, PhLock , which leverages an application’s varying runtime characteristics to dynamically select the locked memory contents to optimize cache energy consumption. Using a variety of applications from the SPEC2006 and MiBench benchmark suites, experimental results show that PhLock is promising for reducing both the instruction and data caches’ energy consumption. As compared to a nonlocking cache, PhLock reduced the instruction and data cache energy consumption by an average of 5% and 39%, respectively, for SPEC2006 applications, and by 75% and 14%, respectively, for MiBench benchmarks.

...read moreread less

5 citations

Journal Article•DOI•

On the Optimality of 0–1 Data Placement in Cache Networks

[...]

Mohammad Javad Salehi¹, Seyed Abolfazl Motahari¹, Babak Hossein Khalaj¹•Institutions (1)

Sharif University of Technology¹

01 Mar 2018-IEEE Transactions on Communications

TL;DR: It is proved that any non-redundant cache placement strategy can be transformed, with no additional cost, to a strategy in which at every node, each file is either cached completely or not cached at all.

...read moreread less

Abstract: Considering cache enabled networks, optimal content placement minimizing the total cost of communication in such networks is studied, leading to a surprising fundamental 0–1 law for non-redundant cache placement strategies, where the total cache sizes associated with each file does not exceed the file size. In other words, for such strategies, we prove that any non-redundant cache placement strategy can be transformed, with no additional cost, to a strategy in which at every node, each file is either cached completely or not cached at all. Moreover, we obtain a sufficient condition under which the optimal cache placement strategy is in fact non-redundant. This result together with the 0–1 law reveals that situations exist, where optimal content placement is achieved just by uncoded placement of whole files in caches.

...read moreread less

Journal Article•DOI•

Replacement Policy Adaptable Miss Curve Estimation for Efficient Cache Partitioning

[...]

Byunghoon Lee¹, Kwangsu Kim¹, Eui-Young Chung¹•Institutions (1)

Yonsei University¹

01 Feb 2018-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: A replacement policy adaptable miss curve estimation (RME) which estimates dynamic workload patterns according to any arbitrary replacement policy and to given applications with low overhead is proposed which supports the efficiency of RME and shows that RME-based cache partitioning cooperated with high-performance replacement policies can minimize both inter- and intra-application interference successfully.

...read moreread less

Abstract: Cache replacement policies and cache partitioning are well-known cache management techniques which aim to eliminate inter- and intra-application contention caused by co-running applications, respectively. Since replacement policies can change applications’ behavior on a shared last-level cache, they have a massive impact on cache partitioning. Furthermore, cache partitioning determines the capacity allocated to each application affecting incorporated replacement policy. However, their interoperability has not been thoroughly explored. Since existing cache partitioning methods are tailored to specific replacement policies to reduce overheads for characterization of applications’ behavior, they may lead to suboptimal partitioning results when incorporated with the up-to-date replacement policies. In cache partitioning, miss curve estimation is a key component to relax this restriction which can reflect the dependency between a replacement policy and cache partitioning on partitioning decision. To tackle this issue, we propose a replacement policy adaptable miss curve estimation (RME) which estimates dynamic workload patterns according to any arbitrary replacement policy and to given applications with low overhead. In addition, RME considers asymmetry of miss latency by miss type, thus the impact of miss curve on cache partitioning can be reflected more accurately. The experimental results support the efficiency of RME and show that RME-based cache partitioning cooperated with high-performance replacement policies can minimize both inter- and intra-application interference successfully.

...read moreread less

Book Chapter•DOI•

The Key Role of Memory in Next-Generation Embedded Systems for Military Applications

[...]

Ignacio Sañudo, Paolo Cortimiglia¹, Luca Miccio, Marco Solieri, Paolo Burgio, Christian Di Biagio¹, Franco Felici¹, Giovanni Nuzzo¹, Marko Bertogna - Show less +5 more•Institutions (1)

MBDA¹

07 Jun 2018

TL;DR: The need for accurate memory-centric scheduling mechanisms for guaranteeing prioritized memory accesses to Real-Time safety-related components of the system is highlighted and isolation at timing and spatial level can be achieved by managing the lines that can be evicted in the cache.

...read moreread less

Abstract: With the increasing use of multi-core platforms in safety-related domains, aircraft system integrators and authorities exhibit a concern about the impact of concurrent access to shared-resources in the Worst-Case Execution Time (WCET). This paper highlights the need for accurate memory-centric scheduling mechanisms for guaranteeing prioritized memory accesses to Real-Time safety-related components of the system. We implemented a software technique called cache coloring that demonstrates that isolation at timing and spatial level can be achieved by managing the lines that can be evicted in the cache. In order to show the effectiveness of this technique, the timing properties of a real application are considered as a use case, this application is made of parallel tasks that show different trade-offs between computation and memory loads.

...read moreread less

Journal Article•DOI•

Toward multi-programmed workloads with different memory footprints: a self-adaptive last level cache scheduling scheme

[...]

Jingyu Zhang¹, Minyi Guo¹, Chentao Wu¹, Yuanyi Chen¹•Institutions (1)

Shanghai Jiao Tong University¹

01 Jan 2018-Science in China Series F: Information Sciences

TL;DR: Experimental results show that the proposed self-adaptive LLC scheduling scheme can improve the multi-programmed workload performance by up to 30% compared with the state-of-the-art methods.

...read moreread less

Abstract: With the emerging of 3D-stacking technology, the dynamic random-access memoryDRAM) can be stacked on chips to architect the DRAM last level cacheLLC). Compared with static random-access memorySRAM), DRAM is larger but slower. In the existing research papers, a lot of work has been devoted to improving the workload performance using SRAM and stacked DRAM together, ranging from SRAM structure improvement, to optimizing cache tag and data access. Instead, little attention has been paid to designing an LLC scheduling scheme for multi-programmed workloads with different memory footprints. Motivated by this, we propose a self-adaptive LLC scheduling scheme, which allows us to utilize SRAM and 3D-stacked DRAM efficiently, achieving better workload performance. This scheduling scheme employs (1) an evaluation unit, which is used to probe and evaluate the cache information during the process of programs being executed; and linebreak (2) an implementation unit, which is used to self-adaptively choose SRAM or DRAM. To make the scheduling scheme work correctly, we develop a data migration policy. We conduct extensive experiments to evaluate the performance of our proposed scheme. Experimental results show that our method can improve the multi-programmed workload performance by up to 30% compared with the state-of-the-art methods.

...read moreread less

Patent•

Cache-coloring memory allocation method and device for search trees

[...]

He Peng, Yang Furong, Xie Gaogang

13 Mar 2018

TL;DR: In this article, a cache-coloring memory allocation method and device for search trees is presented, where the memory which can be mapped to the corresponding coloring colors is allocated when the memory is allocated for each layer of nodes according to the colors corresponding to each layer in the second search tree, thus an effect that the different layers of nodes cannot mutually compete for the cache can be guaranteed, and thus search performance is improved.

...read moreread less

Abstract: The embodiment of the invention provides a cache-coloring memory allocation method and device for search trees. The method includes: constructing the first search tree; determining that the first search tree has N layers of nodes; acquiring the number of the nodes which each layer in the N layers of nodes has; allocating corresponding colors in coloring colors of a memory for each layer of nodes according to the number of the nodes which each layer in the N layers of nodes has, wherein the corresponding colors allocated for all the layers of nodes are mutually different; and generating the second search tree after cache coloring according to the colors corresponding to each layer of nodes and the memory corresponding to the colors. Only the memory which can be mapped to the corresponding coloring colors is allocated when the memory is allocated for each layer of nodes according to the colors corresponding to each layer of nodes in the second search tree, mapping relationships exist between the memory and a Cache, thus an effect that the different layers of nodes cannot mutually compete for the Cache can be guaranteed, and thus search performance is improved.

...read moreread less

Journal Article•DOI•

Evolutionary Design of the Memory Subsystem

[...]

Josefa Díaz Álvarez¹, José L. Risco-Martín², J. Manuel Colmenar³•Institutions (3)

University of Extremadura¹, Complutense University of Madrid², King Juan Carlos University³

01 Jan 2018-Applied Soft Computing

TL;DR: This work addresses the optimization of the whole memory subsystem with three approaches integrated as a single methodology, and simplifies the design and evaluation process of general-purpose and customized dynamic memory manager, in the main memory.

...read moreread less

Journal Article•DOI•

A performance study of the time-varying cache behavior: a study on APEX, Mantevo, NAS, and PARSEC

[...]

Nafiul Alam Siddique¹, Patricia Grubel², Abdel-Hameed A. Badawy¹, Jeanine Cook³•Institutions (3)

New Mexico State University¹, Los Alamos National Laboratory², Sandia National Laboratories³

01 Feb 2018-The Journal of Supercomputing

TL;DR: The results suggest that variable cache line size can result in better performance and can also conserve power, and present runtime cache utilization, as well as, conventional performance metrics that illustrate a holistic understanding of cache behavior.

...read moreread less

Abstract: Cache has long been used to minimize the latency of main memory accesses by storing frequently used data near the processor. Processor performance depends on the underlying cache performance. Therefore, significant research has been done to identify the most crucial metrics of cache performance. Although the majority of research focuses on measuring cache hit rates and data movement as the primary cache performance metrics, cache utilization is significantly important. We investigate the application’s locality using cache utilization metrics. Furthermore, we present cache utilization and traditional cache performance metrics as the program progresses providing detailed insights into the dynamic application behavior on parallel applications from four benchmark suites running on multiple cores. We explore cache utilization for APEX, Mantevo, NAS, and PARSEC, mostly scientific benchmark suites. Our results indicate that 40% of the data bytes in a cache line are accessed at least once before line eviction. Also, on average a byte is accessed two times before the cache line is evicted for these applications. Moreover, we present runtime cache utilization, as well as, conventional performance metrics that illustrate a holistic understanding of cache behavior. To facilitate this research, we build a memory simulator incorporated into the Structural Simulation Toolkit (Rodrigues et al. in SIGMETRICS Perform Eval Rev 38(4):37–42, 2011). Our results suggest that variable cache line size can result in better performance and can also conserve power.

...read moreread less

Journal Article•DOI•

OLM: online LLC management for container-based cloud service

[...]

Hanul Sung¹, Myungsun Kim¹, Jeesoo Min¹, Hyeonsang Eom¹•Institutions (1)

Seoul National University¹

01 Feb 2018-The Journal of Supercomputing

TL;DR: An efficient LLC management scheme that makes two groups at runtime without using any offline profiling data on containers, and suggests that the performance of a normal container can be improved by up to 40% in the case of using this proposed scheme.

...read moreread less

Abstract: In contrast to the hypervisor-based virtualization method, the container-based scheme does not incur the overhead required by virtual machines since it requires neither a fully abstract hardware stack nor separate guest operating systems (OSes). In this virtualization method, the host OS controls the accesses of the containers to hardware resources. One container can thus be provided with resources such as CPU, memory and network, expectedly isolated from the others. However, due to the lack of architectural support, the last-level cache (LLC) is not utilized in an isolated manner, and thus, it is shared by all containers in the same cloud infrastructure. If a workload of a container leads to cache pollution, it negatively affects the performance of other workloads. To address this problem, we propose an efficient LLC management scheme. By monitoring the memory access pattern, the indirect LLC usage pattern of a container can be figured out. Then, our proposed scheme makes two groups at runtime without using any offline profiling data on containers. The first group is made up of cache-thrashing containers, which fill up the LLC without any temporal locality of data, and the second one consists of normal ones. For isolation, the two separate groups use different partitions of the LLC by the OS-based page coloring method. Our experimental study suggests that the performance of a normal container can be improved by up to 40% in the case of using our proposed scheme.

...read moreread less

Book Chapter•DOI•

Cache Reuse Aware Replacement Policy for Improving GPU Cache Performance

[...]

Dong Oh Son, Gwang Bok Kim¹, Jong-Myon Kim², Cheol Hong Kim¹•Institutions (2)

Chonnam National University¹, University of Ulsan²

01 Jan 2018

TL;DR: The proposed cache reuse replacement policy manages cache blocks by separating reused cache blocks and thrashing cache blocks, which can increase IPC by up to 4.4% compared to the conventional GPU architecture.

...read moreread less

Abstract: The performance of computing systems has been improved significantly for several decades. However, increasing the throughput of recent CPUs (Central Processing Units) is restricted by power consumption and thermal issues. GPUs (Graphics Processing Units) are recognized as efficient computing platform with powerful hardware resources to support CPUs in computing systems. Unlike CPUs, there is a large number of CUDA (Compute Unified Device Architecture) cores in GPUs, hence, some cache blocks are referenced many times repeatedly. If those cache blocks reside in the cache for long time, hit rates can be improved. On the other hand, many cache blocks are referenced only once and never referenced again in the cache. These blocks waste cache memory space, resulting in reduced GPU performance. Conventional LRU replacement policy cannot consider the problems from non-reused cache blocks and frequently-reused cache blocks. In this paper, a new cache replacement policy based on the reuse pattern of cache blocks is proposed. The proposed cache replacement policy manages cache blocks by separating reused cache blocks and thrashing cache blocks. According to simulation results, the proposed cache reuse replacement policy can increase IPC by up to 4.4% compared to the conventional GPU architecture.

...read moreread less

Proceedings Article•DOI•

To Use or Not to Use: CPUs' Cache Optimization Techniques on GPGPUs

[...]

Vajira Thambawita, Roshan Ragel, Dhammike Elkaduwe

09 Oct 2018-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: In this paper, the cache memory optimization techniques have been adopted to the GPGPU's cache memory to identify rare performance improvement techniques compared to GPU's best practices, such as blocking, loop fusion, array merging and array transposition.

...read moreread less

Abstract: General Purpose Graphic Processing Unit(GPGPU) is used widely for achieving high performance or high throughput in parallel programming. This capability of GPGPUs is very famous in the new era and mostly used for scientific computing which requires more processing power than normal personal computers. Therefore, most of the programmers, researchers and industry use this new concept for their work. However, achieving high-performance or high-throughput using GPGPUs are not an easy task compared with conventional programming concepts in the CPU side. In this research, the CPU's cache memory optimization techniques have been adopted to the GPGPU's cache memory to identify rare performance improvement techniques compared to GPGPU's best practices. The cache optimization techniques of blocking, loop fusion, array merging and array transpose were tested on GPGPUs for finding suitability of these techniques. Finally, we identified that some of the CPU cache optimization techniques go well with the cache memory system of the GPGPU and shows performance improvements while some others show the opposite effect on the GPGPUs compared with the CPUs.

...read moreread less

Patent•

Method and system for dividing last-level shared cache

[...]

Zhang Deshan

12 Jan 2018

Abstract: The invention provides a method for dividing last-level shared cache. The method comprises the steps that the optimal cache in the running process of each processor core is determined; based on the optimal cache, the page coloring number to be allocated of each processor core is determined; based on the optimal cache and the page coloring number, the cache line number to be allocated of each processor core is calculated; in descending order of the page coloring numbers and the cache line numbers corresponding to the processor cores, the last-level shared cache is divided. Two-dimensional division of the last-level cache is achieved based on the page coloring numbers and the cache line numbers, so that the division granularity is refined, and the expansivity is better. The invention furtherprovides a system for dividing the last-level shared cache, and the system has the above advantages.

...read moreread less