scispace - formally typeset
Proceedings ArticleDOI

Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache

Reads0
Chats0
TLDR
This paper introduces Footprint Cache, an efficient die-stacked DRAM cache design for server processors that eliminates the excessive off-chip traffic associated with page-based designs, while preserving their high hit ratio, small tag array overhead, and low lookup latency.
Abstract
Recent research advocates using large die-stacked DRAM caches to break the memory bandwidth wall. Existing DRAM cache designs fall into one of two categories --- block-based and page-based. The former organize data in conventional blocks (e.g., 64B), ensuring low off-chip bandwidth utilization, but co-locate tags and data in the stacked DRAM, incurring high lookup latency. Furthermore, such designs suffer from low hit ratios due to poor temporal locality. In contrast, page-based caches, which manage data at larger granularity (e.g., 4KB pages), allow for reduced tag array overhead and fast lookup, and leverage high spatial locality at the cost of moving large amounts of data on and off the chip.This paper introduces Footprint Cache, an efficient die-stacked DRAM cache design for server processors. Footprint Cache allocates data at the granularity of pages, but identifies and fetches only those blocks within a page that will be touched during the page's residency in the cache --- i.e., the page's footprint. In doing so, Footprint Cache eliminates the excessive off-chip traffic associated with page-based designs, while preserving their high hit ratio, small tag array overhead, and low lookup latency. Cycle-accurate simulation results of a 16-core server with up to 512MB Footprint Cache indicate a 57% performance improvement over a baseline chip without a die-stacked cache. Compared to a state-of-the-art block-based design, our design improves performance by 13% while reducing dynamic energy of stacked DRAM by 24%.

read more

Citations
More filters
Proceedings ArticleDOI

TOP-PIM: throughput-oriented programmable processing in memory

TL;DR: This work explores the use of 3D die stacking to move memory-intensive computations closer to memory and introduces a methodology for rapid design space exploration by analytically predicting performance and energy of in-memory processors based on metrics obtained from execution on today's GPU hardware.
Proceedings ArticleDOI

Practical Near-Data Processing for In-Memory Analytics Frameworks

TL;DR: This paper develops the hardware and software of an NDP architecture for in-memory analytics frameworks, including MapReduce, graphprocessing, and deep neural networks, and shows that it is critical to optimize software frameworks for spatial locality as it leads to 2.9x efficiency improvements for NDP.
Proceedings ArticleDOI

HRL: Efficient and flexible reconfigurable logic for near-data processing

TL;DR: Heterogeneous Reconfigurable Logic (HRL), a reconfigurable array for NDP systems that improves on both FPGA and CGRA arrays, and achieves 92% of the peak performance of an NDP system based on custom accelerators for each application.
Proceedings ArticleDOI

Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache

TL;DR: Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities and employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads.
Proceedings ArticleDOI

Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories

TL;DR: This work explores the challenges of exposing the stacked DRAM as part of the system's physical address space, and presents an HMA approach with low hardware and software impact that can dynamically tune itself to different application scenarios, achieving performance even better than the (impractical-to-implement) baseline approaches.
References
More filters
Journal ArticleDOI

DRAMSim2: A Cycle Accurate Memory System Simulator

TL;DR: The process of validating DRAMSim2 timing against manufacturer Verilog models in an effort to prove the accuracy of simulation results is described.
Proceedings ArticleDOI

Clearing the clouds: a study of emerging scale-out workloads on modern hardware

TL;DR: This work identifies the key micro-architectural needs of scale-out workloads, calling for a change in the trajectory of server processors that would lead to improved computational density and power efficiency in data centers.
Journal ArticleDOI

3D-Stacked Memory Architectures for Multi-core Processors

TL;DR: This work explores more aggressive 3D DRAM organizations that make better use of the additional die-to-die bandwidth provided by 3D stacking, as well as the additional transistor count, to achieve a 1.75x speedup over previously proposed 3D-DRAM approaches on memory-intensive multi-programmed workloads on a quad-core processor.
Proceedings ArticleDOI

SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling

TL;DR: The Sampling Microarchitecture Simulation (SMARTS) framework is presented as an approach to enable fast and accurate performance measurements of full-length benchmarks and accelerates simulation by selectively measuring in detail only an appropriate benchmark subset.
Proceedings ArticleDOI

Reactive NUCA: near-optimal block placement and replication in distributed caches

TL;DR: Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache, is proposed.
Related Papers (5)