scispace - formally typeset
Search or ask a question

Showing papers by "Todd C. Mowry published in 2013"


Proceedings ArticleDOI
07 Dec 2013
TL;DR: RowClone is proposed, a new and simple mechanism to perform bulk copy and initialization completely within DRAM — eliminating the need to transfer any data over the memory channel to perform such operations.
Abstract: Several system-level operations trigger bulk data copy or initialization. Even though these bulk data operations do not require any computation, current systems transfer a large quantity of data back and forth on the memory channel to perform such operations. As a result, bulk data operations consume high latency, bandwidth, and energy — degrading both system performance and energy efficiency. In this work, we propose RowClone, a new and simple mechanism to perform bulk copy and initialization completely within DRAM — eliminating the need to transfer any data over the memory channel to perform such operations. Our key observation is that DRAM can internally and efficiently transfer a large quantity of data (multiple KBs) between a row of DRAM cells and the associated row buffer. Based on this, our primary mechanism can quickly copy an entire row of data from a source row to a destination row by first copying the data from the source row to the row buffer and then from the row buffer to the destination row, via two back-to-back activate commands. This mechanism, which we call the Fast Parallel Mode of RowClone, reduces the latency and energy consumption of a 4KB bulk copy operation by 11.6× and 74.4×, respectively, and a 4KB bulk zeroing operation by 6.0× and 41.5×, respectively. To efficiently copy data between rows that do not share a row buffer, we propose a second mode of RowClone, the Pipelined Serial Mode, which uses the shared internal bus of a DRAM chip to quickly copy data between two banks. RowClone requires only a 0.01% increase in DRAM chip area. We quantitatively evaluate the benefits of RowClone by focusing on fork, one of the frequently invoked system calls, and five other copy and initialization intensive applications. Our results show that RowClone can significantly improve both single-core and multi-core system performance, while also significantly reducing main memory bandwidth and energy consumption.

385 citations


Proceedings ArticleDOI
07 Dec 2013
TL;DR: It is shown that any compression algorithm can be adapted to fit the requirements of LCP, and two previously-proposed compression algorithms to LCP are adapted: Frequent Pattern Compression and Base-Delta-Immediate Compression.
Abstract: Data compression is a promising approach for meeting the increasing memory capacity demands expected in future systems. Unfortunately, existing compression algorithms do not translate well when directly applied to main memory because they require the memory controller to perform non-trivial computation to locate a cache line within a compressed memory page, thereby increasing access latency and degrading system performance. Prior proposals for addressing this performance degradation problem are either costly or energy inefficient. By leveraging the key insight that all cache lines within a page should be compressed to the same size, this paper proposes a new approach to main memory compression — Linearly Compressed Pages (LCP) — that avoids the performance degradation problem without requiring costly or energy-inefficient hardware. We show that any compression algorithm can be adapted to fit the requirements of LCP, and we specifically adapt two previously-proposed compression algorithms to LCP: Frequent Pattern Compression and Base-Delta-Immediate Compression. Evaluations using benchmarks from SPEC CPU2006 and five server benchmarks show that our approach can significantly increase the effective memory capacity (by 69% on average). In addition to the capacity gains, we evaluate the benefit of transferring consecutive compressed cache lines between the memory controller and main memory. Our new mechanism considerably reduces the memory bandwidth requirements of most of the evaluated benchmarks (by 24% on average), and improves overall performance (by 6.1%/13.9%/10.7% for single-/two-/four-core workloads on average) compared to a baseline system that does not employ main memory compression. LCP also decreases energy consumed by the main memory subsystem (by 9.5% on average over the best prior mechanism).

153 citations


01 Jan 2013
TL;DR: The key idea is to utilize and extend the row-granularity data-transfer in order to quickly initialize or move data one row at a time within a DRAM chip, and call this new mechanism RowClone.
Abstract: Many programs initialize or copy large amounts of memory data. Initialization and copying are forms of memory operations that do not require computation in order to derive their data-values – they either deal with known data-values (e.g., initialize to zero) or simply move data-values that already exist elsewhere in memory (e.g., copy). Therefore, initialization/copying can potentially be performed entirely within the main memory subsystem without involving the processor or the DMA engines. Unfortunately, existing main memory subsystems are unable to take advantage of this fact. Instead, they unnecessarily transfer large amounts of data between main memory and the processor (or the DMA engine) – thereby consuming large amounts of latency, bandwidth, and energy. In this paper, we make the key observation that DRAM chips – the predominant substrate for main memory – already have the capability to transfer large amounts of data within themselves. Internally, a DRAM chip consists of rows of bits and a row-buffer. To access data from any portion of a particular row, the DRAM chip transfers the entire row (e.g., 4 Kbits) into the equally-sized rowbuffer, and vice versa. While this internal data-transfer between a row and the row-buffer occurs in bulk (i.e., all 4 Kbits at once), an external data-transfer (to/from the processor) is severely serialized due to the very narrow width of the DRAM chip’s data-pins (e.g., 8 bits). Our key idea is to utilize and extend the row-granularity data-transfer in order to quickly initialize or move data one row at a time within a DRAM chip. We call this new mechanism RowClone. By making several relatively unintrusive changes to DRAM chip design (0.026% die-size increase), we accelerate a one-page (4 KByte) copying operation by 11.5x, and a one-page zeroing operation by 5.8x, while also conserving memory bandwidth. In addition, we achieve large energy reductions – 41.5x/74.4x energy reductions for one-page zeroing/copying, respectively. We show that RowClone improves performance on an 8-core system by 27% averaged across 64 copy/initialization-intensive workloads. April 25, 2013 DRAFT

27 citations