scispace - formally typeset
Search or ask a question

Showing papers on "Cache coloring published in 2019"


Proceedings ArticleDOI
16 Apr 2019
TL;DR: This paper presents a framework of software-based techniques to restore memory access determinism in high-performance embedded systems that leverages OS-transparent and DMA-friendly cache coloring, in combination with an invalidation-driven allocation technique.
Abstract: One of the main predictability bottlenecks of modern multi-core embedded systems is contention for access to shared memory resources. Partitioning and software-driven allocation of memory resources is an effective strategy to mitigate contention in the memory hierarchy. Unfortunately, however, many of the strategies adopted so far can have unforeseen side-effects when practically implemented latest-generation, high-performance embedded platforms. Predictability is further jeopardized by cache eviction policies based on random replacement, targeting average performance instead of timing determinism. In this paper, we present a framework of software-based techniques to restore memory access determinism in high-performance embedded systems. Our approach leverages OS-transparent and DMA-friendly cache coloring, in combination with an invalidation-driven allocation (IDA) technique. The proposed method allows protecting important cache blocks from (i) external eviction by tasks concurrently executing on different cores, and (ii) internal eviction by tasks running on the same core. A working implementation obtained by extending the Jailhouse partitioning hypervisor is presented and evaluated with a combination of synthetic and real benchmarks.

50 citations


Proceedings ArticleDOI
16 Apr 2019
TL;DR: This paper presents Fractional GPUs (FGPUs), a software-only mechanism to partition both compute and memory resources of a GPU to allow parallel execution of GPU workloads with performance isolation.
Abstract: GPUs are increasingly being used in real-time systems, such as autonomous vehicles, due to the vast performance benefits that they offer. As more and more applications use GPUs, more than one application may need to run on the same GPU in parallel. However, real-time systems also require predictable performance from each individual applications which GPUs do not fully support in a multi-tasking environment. Nvidia recently added a new feature in their latest GPU architecture that allows limited resource provisioning. This feature is provided in the form of a closed-source kernel module called the Multi-Process Service (MPS). However, MPS only provides the capability to partition the compute resources of GPU and does not provide any mechanism to avoid inter-application conflicts within the shared memory hierarchy. In our experiments, we find that compute resource partitioning alone is not sufficient for performance isolation. In the worst case, due to interference from co-running GPU tasks, read/write transactions can observe a slowdown of more than 10x. In this paper, we present Fractional GPUs (FGPUs), a software-only mechanism to partition both compute and memory resources of a GPU to allow parallel execution of GPU workloads with performance isolation. As many details of GPU memory hierarchy are not publicly available, we first reverse-engineer the information through various micro-benchmarks. We find that the GPU memory hierarchy is different from that of the CPU, making it well-suited for page coloring. Based on our findings, we were able to partition both the L2 cache and DRAM for multiple Nvidia GPUs. Furthermore, we show that a better strategy exists for partitioning compute resources than the one used by MPS. An FGPU combines both this strategy and memory coloring to provide superior isolation. We compare our FGPU implementation with Nvidia MPS. Compared to MPS, FGPU reduces the average variation in application runtime, in a multi-tenancy environment, from 135% to 9%. To allow multiple applications to use FGPUs seamlessly, we ported Caffe, a popular framework used for machine learning, to use our FGPU API.

38 citations


Proceedings ArticleDOI
01 Jul 2019
TL;DR: This paper provides a full-stack, working implementation on a latest-generation MPSoC platform, and shows results based on both a set of data intensive tasks, as well as a case study based on an image processing benchmark application.
Abstract: Multiprocessor Systems-on-Chip (MPSoC) integrating hard processing cores with programmable logic (PL) are becoming increasingly common. While these platforms have been originally designed for high performance computing applications, their rich feature set can be exploited to efficiently implement mixed criticality domains serving both critical hard real-time tasks, as well as soft real-time tasks. In this paper, we take a deep look at commercially available heterogeneous MPSoCs that incorporate PL and a multicore processor. We show how one can tailor these processors to support a mixed criticality system, where cores are strictly isolated to avoid contention on shared resources such as Last-Level Cache (LLC) and main memory. In order to avoid conflicts in last-level cache, we propose the use of cache coloring, implemented in the Jailhouse hypervisor. In addition, we employ ScratchPad Memory (SPM) inside the PL to support a multi-phase execution model for real-time tasks that avoids conflicts in shared memory. We provide a full-stack, working implementation on a latest-generation MPSoC platform, and show results based on both a set of data intensive tasks, as well as a case study based on an image processing benchmark application.

26 citations


Journal ArticleDOI
TL;DR: Online algorithms that dynamically adjust partition sizes in order to maximize the overall utility are developed and prove that they converge to optimal solutions, and through numerical evaluations they are effective.
Abstract: In this paper, we consider the problem of allocating cache resources among multiple content providers. The cache can be partitioned into slices and each partition can be dedicated to a particular content provider or shared among a number of them. It is assumed that each partition employs the least recently used policy for managing content. We propose utility-driven partitioning, where we associate with each content provide a utility that is a function of the hit rate observed by the content provider. We consider two scenarios: 1) content providers serve disjoint sets of files and 2) there is some overlap in the content served by multiple content providers. In the first case, we prove that cache partitioning outperforms cache sharing as cache size and a number of contents served by providers go to infinity. In the second case, it can be beneficial to have separate partitions for overlapped content. In the case of two providers, it is usually always beneficial to allocate a cache partition to serve all overlapped content and separate partitions to serve the non-overlapped contents of both providers. We establish conditions when this is true asymptotically but also present an example where it is not true asymptotically. We develop online algorithms that dynamically adjust partition sizes in order to maximize the overall utility and prove that they converge to optimal solutions, and through numerical evaluations we show they are effective.

15 citations


Journal ArticleDOI
TL;DR: The proposed scheme reduces data replacement time in the event of changes in topology or cache data replacement using the concept of temporal cache, and has a higher cache hit ratio, and lower cost for data replacement and query processing than existing schemes.
Abstract: In this paper, we propose a cooperative caching scheme for multimedia data via clusters based on peers connectivity in mobile P2P networks In the proposed scheme, a cluster is organized for cache sharing among mobile peers with long-term connectivity, and metadata are disseminated to neighbor peers for efficient multimedia data search performance It reduces data duplication and uses cache space efficiently through integrative cache management of peers inside the cluster The proposed scheme reduces data replacement time in the event of changes in topology or cache data replacement using the concept of temporal cache It performs data recovery and cluster adjustment through cluster management in the event of an abrupt disconnection of a peer In this scheme, metadata of popular multimedia data are disseminated to neighbor peers for efficient data searching In a data search, queries are processed in the order of local cache, metadata, the cluster to which it belongs, and neighbor clusters, in accordance with cooperative caching strategy Performance evaluation results show that the proposed scheme has a higher cache hit ratio, and lower cost for data replacement and query processing than existing schemes

6 citations


Proceedings ArticleDOI
01 Oct 2019
TL;DR: A cache partition controller called LLC-PC is proposed that uses the Palloc page coloring framework to decrease the cache partition sizes for applications during runtime allowing more cache space to be allocated for other applications.
Abstract: The current trend in automotive systems is to integrate more software applications into fewer ECU's to decrease the cost and increase efficiency. This means more applications share the same resources which in turn can cause congestion on resources such as such as caches. Shared resource congestion may cause problems for time critical applications due to unpredictable interference among applications. It is possible to reduce the effects of shared resource congestion using cache partitioning techniques, which assign dedicated cache lines to different applications. We propose a cache partition controller called LLC-PC that uses the Palloc page coloring framework to decrease the cache partition sizes for applications during runtime. LLC-PC creates cache partitioning directives for the Palloc tool by evaluating the performance gained from increasing the cache partition size. We have evaluated LLC-PC using 3 different applications, including the SIFT image processing algorithm which is commonly used for feature detection in vision systems. We show that LLC-PC is able to decrease the amount of cache size allocated to applications while maintaining their performance allowing more cache space to be allocated for other applications.

6 citations


Patent
16 Jul 2019
TL;DR: In this article, a data structure such as a linked list is maintained to track information representative of hash-mapped cache locations of a hash mapped cache, in which the information tracks a sequential order of entering data into each hash mapping location.
Abstract: The described technology is directed towards efficiently invalidating cached data (e.g., expired data) in a hash-mapped cache, e.g., on a timed basis. As a result, data is able returned from the cache without checking for whether that data is expired, (if desired and acceptable), because if expired, the data is only briefly expired since the last invalidation run. To this end, a data structure such as a linked list is maintained to track information representative of hash-mapped cache locations of a hash-mapped cache, in which the information tracks a sequential order of entering data into each hash-mapped cache location. An invalidation run is performed on part of the hash mapped cache, including using the tracking information to invalidate a sequence of one or more cache locations, e.g., only the sequence of those locations that contain expired data.

4 citations


Journal ArticleDOI
TL;DR: This work proposes Synergy, a hypervisor managed caching system to improve memory efficiency in over-commitment scenarios and implements a novel file-level eviction policy that prevents hypervisor caching benefits from being squandered away due to partial cache hits.
Abstract: Efficient system-wide memory management is an important challenge for over-commitment based hosting in virtualized systems. Due to the limitation of memory domains considered for sharing, current deduplication solutions simply cannot achieve system-wide deduplication. Popular memory management techniques like sharing and ballooning enable important memory usage optimizations individually. However, they do not complement each other and, in fact, may degrade individual benefits when combined. We propose $\mathsf{Synergy}$Synergy, a hypervisor managed caching system to improve memory efficiency in over-commitment scenarios. $\mathsf{Synergy}$Synergy builds on an exclusive caching framework to achieve, for the first time, system-wide memory deduplication. $\mathsf{Synergy}$Synergy also enables the co-existence of the mutually agnostic ballooning and sharing techniques within hypervisor managed systems. Finally, $\mathsf{Synergy}$Synergy implements a novel file-level eviction policy that prevents hypervisor caching benefits from being squandered away due to partial cache hits. $\mathsf{Synergy}$Synergy's cache is flexible with configuration knobs for cache sizing and data storage options, and a utility-based cache partitioning scheme. Our evaluation shows that $\mathsf{Synergy}$Synergy consistently uses 10 to 75 percent lesser memory by exploiting system-wide deduplication as compared to inclusive caching techniques and achieves application speedup of 2x to 23x. We also demonstrate the capabilities of $\mathsf{Synergy}$Synergy to increase VM packing density and support for dynamic reconfiguration of cache partitioning policies.

3 citations


Proceedings ArticleDOI
01 Dec 2019
TL;DR: The vertical allocation of L2 cache and LLC in page coloring under hyper-threading is rethink, the impact of color allocation on programs is discussed, and full use of slice information is made and Partial Conflict Color (PCC) is proposed.
Abstract: On modern multi-core machines, page coloring has been used to alleviate the competition at Last Level Cache (LLC). However, the latest development of CPU architecture has brought new issues to page coloring. Firstly, in the case of three-level cache, previous works about page coloring did not discuss the impact on L2 cache of color allocation and the competition for L2 cache is not considered concurrently under hyper-threading. In addition, as the last level cache structure is changed from shared to slice-based and undocumented hash function is applied, page coloring is more complex and slice information is also not fully utilized. This paper presents solutions to these issues. Firstly, by making small changes to the traditional page coloring, the problem that page coloring may waste L2 cache is alleviated. At the same time, we rethink the vertical allocation of L2 cache and LLC in page coloring under hyper-threading, and discuss the impact of color allocation on programs, especially those with different sensitivity to L2 cache and LLC. Finally, we make full use of slice information and propose Partial Conflict Color (PCC). At the same time, we also propose a fast method to obtain PCC. Experiments show that using PCC can improve system performance when the number of colors is insufficient.

2 citations


Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed an integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU, which groups multiple thread blocks that share the same set of pages into a thread batch and applies a page coloring mechanism to bound each stream multiprocessor to the dedicated memory banks.
Abstract: Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to rapid growth in thread-level parallelism of GPU and slowly improved peak memory bandwidth, memory becomes a bottleneck of GPU’s performance and energy efficiency. In this article, we propose an integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU. First, we propose a thread batch enabled memory partitioning (TEMP) to improve GPU memory access parallelism. In particular, TEMP groups multiple thread blocks that share the same set of pages into a thread batch and applies a page coloring mechanism to bound each stream multiprocessor (SM) to the dedicated memory banks. After that, TEMP dispatches the thread batch to an SM to ensure high-parallel memory-access streaming from the different thread blocks. Second, a thread batch-aware scheduling (TBAS) scheme is introduced to improve the GPU memory access locality and to reduce the contention on memory controllers and interconnection networks. Experimental results show that the integration of TEMP and TBAS can achieve up to 10.3% performance improvement and 11.3% DRAM energy reduction across diverse GPU applications. We also evaluate the performance interference of the mixed CPU+GPU workloads when they are run on a heterogeneous system that employs our proposed schemes. Our results show that a simple solution can effectively ensure the efficient execution of both GPU and CPU applications.

2 citations


Posted Content
TL;DR: To increase predictability of COTS component the authors use cache coloring, a technique widely used to partition cache memory, which is a WCET aware heuristic which parti-tion task according to the needs of each task.
Abstract: The predictability of a system is the condition to give saferbound on worst case execution timeof real-time tasks which are running on it. Commercial off-the-shelf(COTS) processors are in-creasingly used in embedded systems and contain shared cache memory. This component hasa hard predictable behavior because its state depends of theexecution history of the this http URL increase predictability of COTS component we use cache coloring, a technique widely usedto partition cache memory. Our main contribution is a WCET aware heuristic which parti-tion task according to the needs of each task. Our experiments are made with CPLEX an ILPsolver with random tasks set generated running on preemptive system scheduled with earliestdeadline first(EDF).

01 Jan 2019
TL;DR: This paper provides a full-stack, working implementation on a latest-generation SoC platform, and shows results based on both a set of data intensive tasks, as well as a complete case study based on an anomaly detection application for an autonomous vehicle.
Abstract: Systems-on-a-Chip (SoC) devices integrating hard processing cores with programmable logic (PL) are becoming increasingly available. While these platforms have been originally designed for high performance computing applications, their rich feature set can be exploited to efficiently implement mixed criticality domains serving both critical hard real-time tasks, as well as soft real-time tasks. In this paper, we take a deep look at COTS-based heterogeneous SoCs that incorporates PL and a multicore processor. We show how one can tailor these processors to support a mixed/full criticality system, where cores are strictly isolated to avoid contention on shared resources such as Last-Level Cache (LLC) and main memory. In order to avoid conflicts in LLC, we propose the use of cache coloring, implemented in the Jailhouse hypervisor. In addition, we employ ScratchPad Memory (SPM) inside the PL to support a multi-phase execution model for realtime tasks that avoids conflicts in shared memory. We provide a full-stack, working implementation on a latest-generation SoC platform, and show results based on both a set of data intensive tasks, as well as a complete case study based on an anomaly detection application for an autonomous vehicle.

Journal ArticleDOI
TL;DR: The results obtained with the proposed framework indicate that the management of video encoding parameters combined with application-tuned cache specifications has a high potential to reduce energy consumption of video coding systems while keeping video quality.
Abstract: This article presents a framework for assessing the behavior and energy impact of cache hierarchies when encoding HEVC on general-purpose processors. The memory energy estimation framework estimates energy consumption of cache hierarchies based on mathematical models combined with memory access profiling tools. The energy analysis of several cache hierarchies targeting HEVC encoders with different input parameters is also carried out. This article provides relevant information on the energy consumption of HEVC encoders by taking into account the different tradeoffs between energy efficiency, coding efficiency, and other important cache memory design parameters, such as miss rates and access latency. The first analysis explores cache performance for different specifications, such as capacity, line size. Results show that most of the energy is spent on reading operations (almost 73% on the first level cache), indicating that HEVC encoders could benefit from memory technologies with low reading energy costs. This analysis also pointed that increasing the capacity affects more the energy performance of the first level cache, which represents 34.78% (on average) more energy consumption than the last level cache. Based on this investigation, we report the most suited cache specifications for HEVC encoders for each video resolution. The second analysis discusses the impact of HEVC input parameters in cache performance, demonstrating that it is possible to save up to 30% of energy with a small increase of 2% in BD-BR. A comparative analysis between HM (HEVC model) and x265 (H.265 video codec) HEVC software models is presented, demonstrating that x265 is faster (speedup to 648x), and more cache efficient providing less memory energy (31.38% on average) compared to the HM implementation. The results obtained with the proposed framework indicate that the management of video encoding parameters combined with application-tuned cache specifications has a high potential to reduce energy consumption of video coding systems while keeping video quality.

Posted Content
TL;DR: An integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU is proposed and results show that a simple solution can effectively ensure the efficient execution of both GPU and CPU applications.
Abstract: Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to rapid growth in thread-level parallelism of GPU and slowly improved peak memory bandwidth, the memory becomes a bottleneck of GPU's performance and energy efficiency. In this work, we propose an integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU. Firstly, we propose a thread batch enabled memory partitioning (TEMP) to improve GPU memory access parallelism. In particular, TEMP groups multiple thread blocks that share the same set of pages into a thread batch and applies a page coloring mechanism to bound each stream multiprocessor (SM) to the dedicated memory banks. After that, TEMP dispatches the thread batch to an SM to ensure high-parallel memory-access streaming from the different thread blocks. Secondly, a thread batch-aware scheduling (TBAS) scheme is introduced to improve the GPU memory access locality and to reduce the contention on memory controllers and interconnection networks. Experimental results show that the integration of TEMP and TBAS can achieve up to 10.3% performance improvement and 11.3% DRAM energy reduction across diverse GPU applications. We also evaluate the performance interference of the mixed CPU+GPU workloads when they are run on a heterogeneous system that employs our proposed schemes. Our results show that a simple solution can effectively ensure the efficient execution of both GPU and CPU applications.