Proceedings ArticleDOI
Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache
Djordje Jevdjic,Stavros Volos,Babak Falsafi +2 more
- Vol. 41, Iss: 3, pp 404-415
Reads0
Chats0
TLDR
This paper introduces Footprint Cache, an efficient die-stacked DRAM cache design for server processors that eliminates the excessive off-chip traffic associated with page-based designs, while preserving their high hit ratio, small tag array overhead, and low lookup latency.Abstract:
Recent research advocates using large die-stacked DRAM caches to break the memory bandwidth wall. Existing DRAM cache designs fall into one of two categories --- block-based and page-based. The former organize data in conventional blocks (e.g., 64B), ensuring low off-chip bandwidth utilization, but co-locate tags and data in the stacked DRAM, incurring high lookup latency. Furthermore, such designs suffer from low hit ratios due to poor temporal locality. In contrast, page-based caches, which manage data at larger granularity (e.g., 4KB pages), allow for reduced tag array overhead and fast lookup, and leverage high spatial locality at the cost of moving large amounts of data on and off the chip.This paper introduces Footprint Cache, an efficient die-stacked DRAM cache design for server processors. Footprint Cache allocates data at the granularity of pages, but identifies and fetches only those blocks within a page that will be touched during the page's residency in the cache --- i.e., the page's footprint. In doing so, Footprint Cache eliminates the excessive off-chip traffic associated with page-based designs, while preserving their high hit ratio, small tag array overhead, and low lookup latency. Cycle-accurate simulation results of a 16-core server with up to 512MB Footprint Cache indicate a 57% performance improvement over a baseline chip without a die-stacked cache. Compared to a state-of-the-art block-based design, our design improves performance by 13% while reducing dynamic energy of stacked DRAM by 24%.read more
Citations
More filters
Proceedings ArticleDOI
TOP-PIM: throughput-oriented programmable processing in memory
Dong Ping Zhang,Nuwan Jayasena,Alexander Lyashevsky,Joseph L. Greathouse,Lifan Xu,Michael Ignatowski +5 more
TL;DR: This work explores the use of 3D die stacking to move memory-intensive computations closer to memory and introduces a methodology for rapid design space exploration by analytically predicting performance and energy of in-memory processors based on metrics obtained from execution on today's GPU hardware.
Proceedings ArticleDOI
Practical Near-Data Processing for In-Memory Analytics Frameworks
TL;DR: This paper develops the hardware and software of an NDP architecture for in-memory analytics frameworks, including MapReduce, graphprocessing, and deep neural networks, and shows that it is critical to optimize software frameworks for spatial locality as it leads to 2.9x efficiency improvements for NDP.
Proceedings ArticleDOI
HRL: Efficient and flexible reconfigurable logic for near-data processing
Mingyu Gao,Christos Kozyrakis +1 more
TL;DR: Heterogeneous Reconfigurable Logic (HRL), a reconfigurable array for NDP systems that improves on both FPGA and CGRA arrays, and achieves 92% of the peak performance of an NDP system based on custom accelerators for each application.
Proceedings ArticleDOI
Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache
TL;DR: Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities and employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads.
Proceedings ArticleDOI
Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories
Mitesh R. Meswani,Sergey Blagodurov,David G. Roberts,John Slice,Mike Ignatowski,Gabriel H. Loh +5 more
TL;DR: This work explores the challenges of exposing the stacked DRAM as part of the system's physical address space, and presents an HMA approach with low hardware and software impact that can dynamically tune itself to different application scenarios, achieving performance even better than the (impractical-to-implement) baseline approaches.
References
More filters
Journal ArticleDOI
DRAMSim2: A Cycle Accurate Memory System Simulator
TL;DR: The process of validating DRAMSim2 timing against manufacturer Verilog models in an effort to prove the accuracy of simulation results is described.
Proceedings ArticleDOI
Clearing the clouds: a study of emerging scale-out workloads on modern hardware
Michael Ferdman,Almutaz Adileh,Onur Kocberber,Stavros Volos,Mohammad Alisafaee,Djordje Jevdjic,Cansu Kaynak,Adrian Daniel Popescu,Anastasia Ailamaki,Babak Falsafi +9 more
TL;DR: This work identifies the key micro-architectural needs of scale-out workloads, calling for a change in the trajectory of server processors that would lead to improved computational density and power efficiency in data centers.
Journal ArticleDOI
3D-Stacked Memory Architectures for Multi-core Processors
TL;DR: This work explores more aggressive 3D DRAM organizations that make better use of the additional die-to-die bandwidth provided by 3D stacking, as well as the additional transistor count, to achieve a 1.75x speedup over previously proposed 3D-DRAM approaches on memory-intensive multi-programmed workloads on a quad-core processor.
Proceedings ArticleDOI
SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling
TL;DR: The Sampling Microarchitecture Simulation (SMARTS) framework is presented as an approach to enable fast and accurate performance measurements of full-length benchmarks and accelerates simulation by selectively measuring in detail only an appropriate benchmark subset.
Proceedings ArticleDOI
Reactive NUCA: near-optimal block placement and replication in distributed caches
TL;DR: Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache, is proposed.
Related Papers (5)
Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design
Moinuddin K. Qureshi,Gabe H. Loh +1 more
Efficiently enabling conventional block sizes for very large die-stacked DRAM caches
Gabriel H. Loh,Mark D. Hill +1 more