Journal ArticleDOI
Optimization of Intercache Traffic Entanglement in Tagless Caches With Tiling Opportunities
S. R. Swamy Saranam Chongala,Sumitha George,Hariram Thirucherai Govindarajan,Jagadish B. Kotra,Madhu Mutyam,John Sampson,Mahmut Kandemir,Vijaykrishnan Narayanan +7 more
TLDR
New replacement policies and energy-friendly mechanisms for tagless LLCs, such as restricted block caching and victim tag buffer caching, are proposed to incorporate L4 eviction costs into L3 replacement decisions efficiently and to address entanglement overheads and pathologies.Abstract:
So-called “tagless” caches have become common as a means to deal with the vast L4 last-level caches (LLCs) enabled by increasing device density, emerging memory technologies, and advanced integration capabilities (e.g., 3-D). Tagless schemes often result in intercache entanglement between tagless cache (L4) and the cache (L3) stewarding its metadata. We explore new cache organization policies that mitigate overheads stemming from the intercache-level replacement entanglement. We incorporate support for explicit tiling shapes that can better match software access patterns to improve the spatial and temporal locality of large block allocations in many essential computational kernels. To address entanglement overheads and pathologies, we propose new replacement policies and energy-friendly mechanisms for tagless LLCs, such as restricted block caching (RBC) and victim tag buffer caching (VBC) to incorporate L4 eviction costs into L3 replacement decisions efficiently. We evaluate our schemes on a range of linear algebra kernels that are software tiled. RBC and VBC demonstrate a reduction in memory traffic of 83/4.4/67% and 69/35.5/76% for 8/32/64 MB L4s, respectively. Besides, RBC and VBC provide speedups of 16/0.3/0.6% and 15.7/1.8/0.8%, respectively, for systems with 8/32/64 MB L4, over a tagless cache with an LRU policy in the L3. We also show that matching the shape of the hardware allocation for each tagless region superblocks to the access order of the software tile improves latency by 13.4% over the baseline tagless cache with reductions in memory traffic of 51% over linear superblocks.read more
Citations
More filters
Proceedings ArticleDOI
Trends and Opportunities for SRAM Based In-Memory and Near-Memory Computation
Srivatsa Srinivasa,Akshay Krishna Ramanathan,Jainaveen Sundaram,Kurian Dileep J,Srinivasan Gopal,Nilesh Jain,Anuradha Srinivasan,Ravi Iyer,Vijaykrishnan Narayanan,Tanay Karnik +9 more
TL;DR: In this article, an I-NMC accelerator is proposed for Sparse Matrix Multiplication (SMM) which can speed up index handling by 10x-60x and 10x -70x energy efficiency based on the workload dimensions.
References
More filters
Proceedings ArticleDOI
Technology comparison for large last-level caches (L 3 Cs): Low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized eDRAM
TL;DR: A comparison of these technologies through full-system simulation shows that the proposed refresh-reduction method makes eDRAM a viable, energy-efficient technology for implementing L3Cs, and optimize SRAM for low leakage and STT-RAM for low write energy.
Journal ArticleDOI
A Survey Of Architectural Approaches for Managing Embedded DRAM and Non-Volatile On-Chip Caches
TL;DR: This paper surveys the architectural approaches proposed for designing memory systems and, specifically, caches with emerging memory technologies, and presents a classification of these technologies and architectural approaches based on their key characteristics.
Proceedings ArticleDOI
A one transistor cell on bulk substrate (1T-Bulk) for low-cost and high density eDRAM
R. Ranica,Alexandre Villaret,Pierre Malinge,Pascale Mazoyer,Damien Lenoble,Philippe Candelier,Francois Jacquet,Pascal Masson,R. Bouchakour,Richard Fournel,J.P. Schoellkopf,Thomas Skotnicki +11 more
TL;DR: In this article, a 1T cell for high-density eDRAM has been successfully developed on bulk silicon substrate for the first time, and the integration of the memory cell in a matrix arrangement is evaluated.
Proceedings ArticleDOI
Nonvolatile memory design based on ferroelectric FETs
Sumitha George,Kaisheng Ma,Ahmedullah Aziz,Xueqing Li,Asif Islam Khan,Sayeef Salahuddin,Meng-Fan Chang,Suman Datta,Jack Sampson,Sumeet Kumar Gupta,Vijaykrishnan Narayanan +10 more
TL;DR: This work proposes a 2-transistor (2T) FEFET-based nonvolatile memory with separate read and write paths that achieves non-destructive read and lower write power at iso-write speed compared to standard FE-RAM.
Proceedings ArticleDOI
5.9 Haswell: A family of IA 22nm processors
Nasser A. Kurd,Muntaquim Chowdhury,Edward A. Burton,Thomas P. Thomas,Christopher P. Mozak,Brent R. Boswell,Manoj B. Lal,Anant Deval,Jonathan P. Douglas,Mahmoud Elassal,Nalamalpu Ankireddy,Timothy M. Wilson,Matthew C. Merten,Srinivas Chennupaty,Gomes Wilfred,Raghavan Kumar +15 more
TL;DR: The primary goals for the Haswell program are platform integration and low power to enable smaller form factors and an Intel AVX2 instruction set that supports floating-point multiply-add (FMA), and 256b SIMD integer achieving 2× the number of floating- point and integer operations over its predecessor.