Optimization of Intercache Traffic Entanglement in Tagless Caches With Tiling Opportunities

doi:10.1109/TCAD.2020.3012789

Journal ArticleDOI

Optimization of Intercache Traffic Entanglement in Tagless Caches With Tiling Opportunities

S. R. Swamy Saranam Chongala, +7 more

- 02 Oct 2020 -

IEEE Transactions on Computer-Aided Desi...

- Vol. 39, Iss: 11, pp 3881-3892

TLDR

New replacement policies and energy-friendly mechanisms for tagless LLCs, such as restricted block caching and victim tag buffer caching, are proposed to incorporate L4 eviction costs into L3 replacement decisions efficiently and to address entanglement overheads and pathologies.

Abstract:

So-called “tagless” caches have become common as a means to deal with the vast L4 last-level caches (LLCs) enabled by increasing device density, emerging memory technologies, and advanced integration capabilities (e.g., 3-D). Tagless schemes often result in intercache entanglement between tagless cache (L4) and the cache (L3) stewarding its metadata. We explore new cache organization policies that mitigate overheads stemming from the intercache-level replacement entanglement. We incorporate support for explicit tiling shapes that can better match software access patterns to improve the spatial and temporal locality of large block allocations in many essential computational kernels. To address entanglement overheads and pathologies, we propose new replacement policies and energy-friendly mechanisms for tagless LLCs, such as restricted block caching (RBC) and victim tag buffer caching (VBC) to incorporate L4 eviction costs into L3 replacement decisions efficiently. We evaluate our schemes on a range of linear algebra kernels that are software tiled. RBC and VBC demonstrate a reduction in memory traffic of 83/4.4/67% and 69/35.5/76% for 8/32/64 MB L4s, respectively. Besides, RBC and VBC provide speedups of 16/0.3/0.6% and 15.7/1.8/0.8%, respectively, for systems with 8/32/64 MB L4, over a tagless cache with an LRU policy in the L3. We also show that matching the shape of the hardware allocation for each tagless region superblocks to the access order of the software tile improves latency by 13.4% over the baseline tagless cache with reductions in memory traffic of 51% over linear superblocks.

Optimization of Intercache Traffic Entanglement in Tagless Caches With Tiling Opportunities

Citations

Trends and Opportunities for SRAM Based In-Memory and Near-Memory Computation

References

Technology comparison for large last-level caches (L 3 Cs): Low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized eDRAM

A Survey Of Architectural Approaches for Managing Embedded DRAM and Non-Volatile On-Chip Caches

A one transistor cell on bulk substrate (1T-Bulk) for low-cost and high density eDRAM

Nonvolatile memory design based on ferroelectric FETs

5.9 Haswell: A family of IA 22nm processors

Related Papers (5)

Cool-Cache: A compiler-enabled energy efficient data caching framework for embedded/multimedia processors

Locality-Driven Dynamic GPU Cache Bypassing

A Multidimensional Software Cache for Scratchpad-Based Systems

Spatial Locality-Aware Cache Partitioning for Effective Cache Sharing

Competitive Cache Replacement Strategies for Shared Cache Environments