Showing papers on "Cache coloring published in 2019"

PDF

Open Access

Proceedings Article•DOI•

Deterministic Memory Hierarchy and Virtualization for Modern Multi-Core Embedded Systems

[...]

Tomasz Kloda, Marco Solieri, Renato Mancuso¹, Nicola Capodieci, Paolo Valente, Marko Bertogna - Show less +2 more•Institutions (1)

Boston University¹

16 Apr 2019

TL;DR: This paper presents a framework of software-based techniques to restore memory access determinism in high-performance embedded systems that leverages OS-transparent and DMA-friendly cache coloring, in combination with an invalidation-driven allocation technique.

...read moreread less

Abstract: One of the main predictability bottlenecks of modern multi-core embedded systems is contention for access to shared memory resources. Partitioning and software-driven allocation of memory resources is an effective strategy to mitigate contention in the memory hierarchy. Unfortunately, however, many of the strategies adopted so far can have unforeseen side-effects when practically implemented latest-generation, high-performance embedded platforms. Predictability is further jeopardized by cache eviction policies based on random replacement, targeting average performance instead of timing determinism. In this paper, we present a framework of software-based techniques to restore memory access determinism in high-performance embedded systems. Our approach leverages OS-transparent and DMA-friendly cache coloring, in combination with an invalidation-driven allocation (IDA) technique. The proposed method allows protecting important cache blocks from (i) external eviction by tasks concurrently executing on different cores, and (ii) internal eviction by tasks running on the same core. A working implementation obtained by extending the Jailhouse partitioning hypervisor is presented and evaluated with a combination of synthetic and real benchmarks.

...read moreread less

50 citations

Proceedings Article•DOI•

Fractional GPUs: Software-Based Compute and Memory Bandwidth Reservation for GPUs

[...]

Saksham Jain¹, Iljoo Baek¹, Shige Wang, Ragunathan Rajkumar¹•Institutions (1)

Carnegie Mellon University¹

16 Apr 2019

TL;DR: This paper presents Fractional GPUs (FGPUs), a software-only mechanism to partition both compute and memory resources of a GPU to allow parallel execution of GPU workloads with performance isolation.

...read moreread less

Abstract: GPUs are increasingly being used in real-time systems, such as autonomous vehicles, due to the vast performance benefits that they offer. As more and more applications use GPUs, more than one application may need to run on the same GPU in parallel. However, real-time systems also require predictable performance from each individual applications which GPUs do not fully support in a multi-tasking environment. Nvidia recently added a new feature in their latest GPU architecture that allows limited resource provisioning. This feature is provided in the form of a closed-source kernel module called the Multi-Process Service (MPS). However, MPS only provides the capability to partition the compute resources of GPU and does not provide any mechanism to avoid inter-application conflicts within the shared memory hierarchy. In our experiments, we find that compute resource partitioning alone is not sufficient for performance isolation. In the worst case, due to interference from co-running GPU tasks, read/write transactions can observe a slowdown of more than 10x. In this paper, we present Fractional GPUs (FGPUs), a software-only mechanism to partition both compute and memory resources of a GPU to allow parallel execution of GPU workloads with performance isolation. As many details of GPU memory hierarchy are not publicly available, we first reverse-engineer the information through various micro-benchmarks. We find that the GPU memory hierarchy is different from that of the CPU, making it well-suited for page coloring. Based on our findings, we were able to partition both the L2 cache and DRAM for multiple Nvidia GPUs. Furthermore, we show that a better strategy exists for partitioning compute resources than the one used by MPS. An FGPU combines both this strategy and memory coloring to provide superior isolation. We compare our FGPU implementation with Nvidia MPS. Compared to MPS, FGPU reduces the average variation in application runtime, in a multi-tenancy environment, from 135% to 9%. To allow multiple applications to use FGPUs seamlessly, we ported Caffe, a popular framework used for machine learning, to use our FGPU API.

...read moreread less

38 citations

Proceedings Article•DOI•

Designing Mixed Criticality Applications on Modern Heterogeneous MPSoC Platforms

[...]

Giovani Gracioli¹, Giovani Gracioli², Rohan Tabish³, Renato Mancuso⁴, Reza Mirosanlou⁵, Rodolfo Pellizzoni⁵, Marco Caccamo³ - Show less +3 more•Institutions (5)

Technische Universität München¹, Universidade Federal de Santa Catarina², University of Illinois at Urbana–Champaign³, Boston University⁴, University of Waterloo⁵

01 Jul 2019

TL;DR: This paper provides a full-stack, working implementation on a latest-generation MPSoC platform, and shows results based on both a set of data intensive tasks, as well as a case study based on an image processing benchmark application.

...read moreread less

Abstract: Multiprocessor Systems-on-Chip (MPSoC) integrating hard processing cores with programmable logic (PL) are becoming increasingly common. While these platforms have been originally designed for high performance computing applications, their rich feature set can be exploited to efficiently implement mixed criticality domains serving both critical hard real-time tasks, as well as soft real-time tasks. In this paper, we take a deep look at commercially available heterogeneous MPSoCs that incorporate PL and a multicore processor. We show how one can tailor these processors to support a mixed criticality system, where cores are strictly isolated to avoid contention on shared resources such as Last-Level Cache (LLC) and main memory. In order to avoid conflicts in last-level cache, we propose the use of cache coloring, implemented in the Jailhouse hypervisor. In addition, we employ ScratchPad Memory (SPM) inside the PL to support a multi-phase execution model for real-time tasks that avoids conflicts in shared memory. We provide a full-stack, working implementation on a latest-generation MPSoC platform, and show results based on both a set of data intensive tasks, as well as a case study based on an image processing benchmark application.

...read moreread less

26 citations

Journal Article•DOI•

Sharing Cache Resources Among Content Providers: A Utility-Based Approach

[...]

Mostafa Dehghan¹, Weibo Chu², Philippe Nain³, Don Towsley⁴, Zhi-Li Zhang⁵ - Show less +1 more•Institutions (5)

Google¹, Northwestern Polytechnical University², French Institute for Research in Computer Science and Automation³, University of Massachusetts Amherst⁴, University of Minnesota⁵

01 Apr 2019-IEEE ACM Transactions on Networking

TL;DR: Online algorithms that dynamically adjust partition sizes in order to maximize the overall utility are developed and prove that they converge to optimal solutions, and through numerical evaluations they are effective.

...read moreread less

Abstract: In this paper, we consider the problem of allocating cache resources among multiple content providers. The cache can be partitioned into slices and each partition can be dedicated to a particular content provider or shared among a number of them. It is assumed that each partition employs the least recently used policy for managing content. We propose utility-driven partitioning, where we associate with each content provide a utility that is a function of the hit rate observed by the content provider. We consider two scenarios: 1) content providers serve disjoint sets of files and 2) there is some overlap in the content served by multiple content providers. In the first case, we prove that cache partitioning outperforms cache sharing as cache size and a number of contents served by providers go to infinity. In the second case, it can be beneficial to have separate partitions for overlapped content. In the case of two providers, it is usually always beneficial to allocate a cache partition to serve all overlapped content and separate partitions to serve the non-overlapped contents of both providers. We establish conditions when this is true asymptotically but also present an example where it is not true asymptotically. We develop online algorithms that dynamically adjust partition sizes in order to maximize the overall utility and prove that they converge to optimal solutions, and through numerical evaluations we show they are effective.

...read moreread less

15 citations

Journal Article•DOI•

Cooperative caching for multimedia data in mobile P2P networks

[...]

Kyoungsoo Bok¹, Jaegu Kim¹, Jaesoo Yoo¹•Institutions (1)

Chungbuk National University¹

01 Mar 2019-Multimedia Tools and Applications

TL;DR: The proposed scheme reduces data replacement time in the event of changes in topology or cache data replacement using the concept of temporal cache, and has a higher cache hit ratio, and lower cost for data replacement and query processing than existing schemes.

...read moreread less

Abstract: In this paper, we propose a cooperative caching scheme for multimedia data via clusters based on peers connectivity in mobile P2P networks In the proposed scheme, a cluster is organized for cache sharing among mobile peers with long-term connectivity, and metadata are disseminated to neighbor peers for efficient multimedia data search performance It reduces data duplication and uses cache space efficiently through integrative cache management of peers inside the cluster The proposed scheme reduces data replacement time in the event of changes in topology or cache data replacement using the concept of temporal cache It performs data recovery and cluster adjustment through cluster management in the event of an abrupt disconnection of a peer In this scheme, metadata of popular multimedia data are disseminated to neighbor peers for efficient data searching In a data search, queries are processed in the order of local cache, metadata, the cluster to which it belongs, and neighbor clusters, in accordance with cooperative caching strategy Performance evaluation results show that the proposed scheme has a higher cache hit ratio, and lower cost for data replacement and query processing than existing schemes

...read moreread less

6 citations

Proceedings Article•DOI•

Run-Time Cache-Partition Controller for Multi-Core Systems

[...]

Jakob Danielsson¹, Marcus Jagemar², Moris Behnam¹, Tiberiu Seceleanu¹, Mikael Sjödin¹ - Show less +1 more•Institutions (2)

Mälardalen University College¹, Ericsson²

01 Oct 2019

TL;DR: A cache partition controller called LLC-PC is proposed that uses the Palloc page coloring framework to decrease the cache partition sizes for applications during runtime allowing more cache space to be allocated for other applications.

...read moreread less

Abstract: The current trend in automotive systems is to integrate more software applications into fewer ECU's to decrease the cost and increase efficiency. This means more applications share the same resources which in turn can cause congestion on resources such as such as caches. Shared resource congestion may cause problems for time critical applications due to unpredictable interference among applications. It is possible to reduce the effects of shared resource congestion using cache partitioning techniques, which assign dedicated cache lines to different applications. We propose a cache partition controller called LLC-PC that uses the Palloc page coloring framework to decrease the cache partition sizes for applications during runtime. LLC-PC creates cache partitioning directives for the Palloc tool by evaluating the performance gained from increasing the cache partition size. We have evaluated LLC-PC using 3 different applications, including the SIFT image processing algorithm which is commonly used for feature detection in vision systems. We show that LLC-PC is able to decrease the amount of cache size allocated to applications while maintaining their performance allowing more cache space to be allocated for other applications.

...read moreread less

6 citations

Patent•

Cache map with sequential tracking for invalidation

[...]

Busayarat Sata

16 Jul 2019

TL;DR: In this article, a data structure such as a linked list is maintained to track information representative of hash-mapped cache locations of a hash mapped cache, in which the information tracks a sequential order of entering data into each hash mapping location.

...read moreread less

Abstract: The described technology is directed towards efficiently invalidating cached data (e.g., expired data) in a hash-mapped cache, e.g., on a timed basis. As a result, data is able returned from the cache without checking for whether that data is expired, (if desired and acceptable), because if expired, the data is only briefly expired since the last invalidation run. To this end, a data structure such as a linked list is maintained to track information representative of hash-mapped cache locations of a hash-mapped cache, in which the information tracks a sequential order of entering data into each hash-mapped cache location. An invalidation run is performed on part of the hash mapped cache, including using the tracking information to invalidate a sequence of one or more cache locations, e.g., only the sequence of those locations that contain expired data.

...read moreread less

4 citations

Journal Article•DOI•

Synergy: A Hypervisor Managed Holistic Caching System

[...]

Debadatta Mishra¹, Purushottam Kulkarni¹, Raju Rangaswami²•Institutions (2)

Indian Institute of Technology Bombay¹, Florida International University²

01 Jul 2019-IEEE Transactions on Cloud Computing

TL;DR: This work proposes Synergy, a hypervisor managed caching system to improve memory efficiency in over-commitment scenarios and implements a novel file-level eviction policy that prevents hypervisor caching benefits from being squandered away due to partial cache hits.

...read moreread less

Abstract: Efficient system-wide memory management is an important challenge for over-commitment based hosting in virtualized systems. Due to the limitation of memory domains considered for sharing, current deduplication solutions simply cannot achieve system-wide deduplication. Popular memory management techniques like sharing and ballooning enable important memory usage optimizations individually. However, they do not complement each other and, in fact, may degrade individual benefits when combined. We propose $\mathsf{Synergy}$Synergy, a hypervisor managed caching system to improve memory efficiency in over-commitment scenarios. $\mathsf{Synergy}$Synergy builds on an exclusive caching framework to achieve, for the first time, system-wide memory deduplication. $\mathsf{Synergy}$Synergy also enables the co-existence of the mutually agnostic ballooning and sharing techniques within hypervisor managed systems. Finally, $\mathsf{Synergy}$Synergy implements a novel file-level eviction policy that prevents hypervisor caching benefits from being squandered away due to partial cache hits. $\mathsf{Synergy}$Synergy's cache is flexible with configuration knobs for cache sizing and data storage options, and a utility-based cache partitioning scheme. Our evaluation shows that $\mathsf{Synergy}$Synergy consistently uses 10 to 75 percent lesser memory by exploiting system-wide deduplication as compared to inclusive caching techniques and achieves application speedup of 2x to 23x. We also demonstrate the capabilities of $\mathsf{Synergy}$Synergy to increase VM packing density and support for dynamic reconfiguration of cache partitioning policies.

...read moreread less

3 citations

Proceedings Article•DOI•

Make Page Coloring more Efficient on Slice-Based Three-Level Cache

[...]

Haifeng Li¹, Tianyue Lu¹, Yuhang Liu¹, Mingyu Chen¹•Institutions (1)

Chinese Academy of Sciences¹

01 Dec 2019

TL;DR: The vertical allocation of L2 cache and LLC in page coloring under hyper-threading is rethink, the impact of color allocation on programs is discussed, and full use of slice information is made and Partial Conflict Color (PCC) is proposed.

...read moreread less

Abstract: On modern multi-core machines, page coloring has been used to alleviate the competition at Last Level Cache (LLC). However, the latest development of CPU architecture has brought new issues to page coloring. Firstly, in the case of three-level cache, previous works about page coloring did not discuss the impact on L2 cache of color allocation and the competition for L2 cache is not considered concurrently under hyper-threading. In addition, as the last level cache structure is changed from shared to slice-based and undocumented hash function is applied, page coloring is more complex and slice information is also not fully utilized. This paper presents solutions to these issues. Firstly, by making small changes to the traditional page coloring, the problem that page coloring may waste L2 cache is alleviated. At the same time, we rethink the vertical allocation of L2 cache and LLC in page coloring under hyper-threading, and discuss the impact of color allocation on programs, especially those with different sensitivity to L2 cache and LLC. Finally, we make full use of slice information and propose Partial Conflict Color (PCC). At the same time, we also propose a fast method to obtain PCC. Experiments show that using PCC can improve system performance when the number of colors is insufficient.

...read moreread less

2 citations

Journal Article•DOI•

Thread Batching for High-performance Energy-efficient GPU Memory Design

[...]

Bing Li¹, Mengjie Mao², Xiaoxiao Liu³, Tao Liu⁴, Zihao Liu⁴, Wujie Wen⁴, Yi Chen⁵, Hai Li⁵ - Show less +4 more•Institutions (5)

Research Triangle Park¹, MathWorks², Advanced Micro Devices³, Florida International University⁴, Duke University⁵

16 Dec 2019-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: Wang et al. as discussed by the authors proposed an integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU, which groups multiple thread blocks that share the same set of pages into a thread batch and applies a page coloring mechanism to bound each stream multiprocessor to the dedicated memory banks.

...read moreread less

Abstract: Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to rapid growth in thread-level parallelism of GPU and slowly improved peak memory bandwidth, memory becomes a bottleneck of GPU’s performance and energy efficiency. In this article, we propose an integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU. First, we propose a thread batch enabled memory partitioning (TEMP) to improve GPU memory access parallelism. In particular, TEMP groups multiple thread blocks that share the same set of pages into a thread batch and applies a page coloring mechanism to bound each stream multiprocessor (SM) to the dedicated memory banks. After that, TEMP dispatches the thread batch to an SM to ensure high-parallel memory-access streaming from the different thread blocks. Second, a thread batch-aware scheduling (TBAS) scheme is introduced to improve the GPU memory access locality and to reduce the contention on memory controllers and interconnection networks. Experimental results show that the integration of TEMP and TBAS can achieve up to 10.3% performance improvement and 11.3% DRAM energy reduction across diverse GPU applications. We also evaluate the performance interference of the mixed CPU+GPU workloads when they are run on a heterogeneous system that employs our proposed schemes. Our results show that a simple solution can effectively ensure the efficient execution of both GPU and CPU applications.

...read moreread less

2 citations

Posted Content•

A WCET-aware cache coloring technique for reducing interference in real-time systems.

[...]

Fabien Bouquillon, Clément Ballabriga, Giuseppe Lipari, Smail Niar

22 Mar 2019-arXiv: Operating Systems

TL;DR: To increase predictability of COTS component the authors use cache coloring, a technique widely used to partition cache memory, which is a WCET aware heuristic which parti-tion task according to the needs of each task.

...read moreread less

Abstract: The predictability of a system is the condition to give saferbound on worst case execution timeof real-time tasks which are running on it. Commercial off-the-shelf(COTS) processors are in-creasingly used in embedded systems and contain shared cache memory. This component hasa hard predictable behavior because its state depends of theexecution history of the this http URL increase predictability of COTS component we use cache coloring, a technique widely usedto partition cache memory. Our main contribution is a WCET aware heuristic which parti-tion task according to the needs of each task. Our experiments are made with CPLEX an ILPsolver with random tasks set generated running on preemptive system scheduled with earliestdeadline first(EDF).

...read moreread less

Tech Report: A Virtualized Scratchpad-Based Architecture for Real-time Event-Triggered Applications

[...]

Giovani Gracioli, Rohan Tabish, Reza Mirosanlou, Renato Mancuso, Rodolfo Pellizzoni, Marco Caccamo - Show less +2 more

01 Jan 2019

TL;DR: This paper provides a full-stack, working implementation on a latest-generation SoC platform, and shows results based on both a set of data intensive tasks, as well as a complete case study based on an anomaly detection application for an autonomous vehicle.

...read moreread less

Abstract: Systems-on-a-Chip (SoC) devices integrating hard processing cores with programmable logic (PL) are becoming increasingly available. While these platforms have been originally designed for high performance computing applications, their rich feature set can be exploited to efficiently implement mixed criticality domains serving both critical hard real-time tasks, as well as soft real-time tasks. In this paper, we take a deep look at COTS-based heterogeneous SoCs that incorporates PL and a multicore processor. We show how one can tailor these processors to support a mixed/full criticality system, where cores are strictly isolated to avoid contention on shared resources such as Last-Level Cache (LLC) and main memory. In order to avoid conflicts in LLC, we propose the use of cache coloring, implemented in the Jailhouse hypervisor. In addition, we employ ScratchPad Memory (SPM) inside the PL to support a multi-phase execution model for realtime tasks that avoids conflicts in shared memory. We provide a full-stack, working implementation on a latest-generation SoC platform, and show results based on both a set of data intensive tasks, as well as a complete case study based on an anomaly detection application for an autonomous vehicle.

...read moreread less

Journal Article•DOI•

Energy-aware cache hierarchy assessment targeting HEVC encoder execution

[...]

Eduarda Monteiro¹, Mateus Grellert¹, Bruno Zatt², Sergio Bampi¹•Institutions (2)

Universidade Federal do Rio Grande do Sul¹, Universidade Federal de Pelotas²

01 Oct 2019-Journal of Real-time Image Processing

TL;DR: The results obtained with the proposed framework indicate that the management of video encoding parameters combined with application-tuned cache specifications has a high potential to reduce energy consumption of video coding systems while keeping video quality.

...read moreread less

Abstract: This article presents a framework for assessing the behavior and energy impact of cache hierarchies when encoding HEVC on general-purpose processors. The memory energy estimation framework estimates energy consumption of cache hierarchies based on mathematical models combined with memory access profiling tools. The energy analysis of several cache hierarchies targeting HEVC encoders with different input parameters is also carried out. This article provides relevant information on the energy consumption of HEVC encoders by taking into account the different tradeoffs between energy efficiency, coding efficiency, and other important cache memory design parameters, such as miss rates and access latency. The first analysis explores cache performance for different specifications, such as capacity, line size. Results show that most of the energy is spent on reading operations (almost 73% on the first level cache), indicating that HEVC encoders could benefit from memory technologies with low reading energy costs. This analysis also pointed that increasing the capacity affects more the energy performance of the first level cache, which represents 34.78% (on average) more energy consumption than the last level cache. Based on this investigation, we report the most suited cache specifications for HEVC encoders for each video resolution. The second analysis discusses the impact of HEVC input parameters in cache performance, demonstrating that it is possible to save up to 30% of energy with a small increase of 2% in BD-BR. A comparative analysis between HM (HEVC model) and x265 (H.265 video codec) HEVC software models is presented, demonstrating that x265 is faster (speedup to 648x), and more cache efficient providing less memory energy (31.38% on average) compared to the HM implementation. The results obtained with the proposed framework indicate that the management of video encoding parameters combined with application-tuned cache specifications has a high potential to reduce energy consumption of video coding systems while keeping video quality.

...read moreread less

Posted Content•

Thread Batching for High-performance Energy-efficient GPU Memory Design.

[...]

Bing Li, Mengjie Mao, Xiaoxiao Liu, Tao Liu, Zihao Liu, Wujie Wen, Yi Chen - Show less +3 more

13 Jun 2019-arXiv: Hardware Architecture

TL;DR: An integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU is proposed and results show that a simple solution can effectively ensure the efficient execution of both GPU and CPU applications.

...read moreread less

Abstract: Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to rapid growth in thread-level parallelism of GPU and slowly improved peak memory bandwidth, the memory becomes a bottleneck of GPU's performance and energy efficiency. In this work, we propose an integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU. Firstly, we propose a thread batch enabled memory partitioning (TEMP) to improve GPU memory access parallelism. In particular, TEMP groups multiple thread blocks that share the same set of pages into a thread batch and applies a page coloring mechanism to bound each stream multiprocessor (SM) to the dedicated memory banks. After that, TEMP dispatches the thread batch to an SM to ensure high-parallel memory-access streaming from the different thread blocks. Secondly, a thread batch-aware scheduling (TBAS) scheme is introduced to improve the GPU memory access locality and to reduce the contention on memory controllers and interconnection networks. Experimental results show that the integration of TEMP and TBAS can achieve up to 10.3% performance improvement and 11.3% DRAM energy reduction across diverse GPU applications. We also evaluate the performance interference of the mixed CPU+GPU workloads when they are run on a heterogeneous system that employs our proposed schemes. Our results show that a simple solution can effectively ensure the efficient execution of both GPU and CPU applications.

...read moreread less