scispace - formally typeset
Open AccessDissertationDOI

Practical Mechanisms for Reducing Processor–Memory Data Movement in Modern Workloads

Reads0
Chats0
TLDR
In this paper, the authors propose a coherence mechanism for near-data processing (NDP) to reduce data movement between the memory system and computation units in modern workloads.
Abstract
Data movement between the memory system and computation units is one of the most critical challenges in designing high performance and energy-efficient computing systems. The high cost of data movement is forcing architects to rethink the fundamental design of computer systems. Recent advances in memory design enable the opportunity for architects to avoid unnecessary datamovement by performing processing-in-memory (PIM), also known as near-data processing (NDP). While PIM can allow many data-intensive applications to avoid moving data from memory to the CPU, it introduces new challenges for system architects and programmers. Our goal in this thesisis to make PIM effective and practical in conventional computing systems. Toward this end, this thesis presents three major directions: (1) examining the suitability of PIM across key workloads, (2) addressing major system challenges for adopting PIM in computing systems, and (3) redesigning applications aware of PIM capability. In line with these three major directions, we propose a seriesof practical mechanisms to reduce processor–memory data movement in modern workloads: First, we comprehensively analyze the energy and performance impact of data movement for several widely-used Google consumer workloads. We find that PIM can significantly reduce datamovement for all of these workloads, by performing part of the computation close to memory. Each workload contains simple primitives and functions that contribute to a significant amount of the overall data movement. We investigate whether these primitives and functions are feasible to implement using PIM, given the limited area and power constraints of consumer devices. Our analysis shows that offloading these primitives to PIM logic, consisting of either simple cores or specialized accelerators, eliminates a large amount of data movement, and significantly reducestotal system energy execution time. Second, we address one of the key system challenges for communication with PIM logic by proposing an efficient cache coherence support for near-data accelerators (NDAs). We find that enforcing coherence with the rest of the system, which is already a major challenge for on-chip accelerators, becomes more difficult for NDAs. This is because (1) the cost of communication between NDAs and CPUs is high, and (2) NDA applications generate a large amount of off-chip datamovement. As a result, as we show in this work, existing coherence mechanisms eliminate most of the benefits of NDAs. Based on our observations, we propose CoNDA, a coherence mechanism that lets an NDA optimistically execute an NDA kernel, under the assumption that the NDA has all necessary coherence permissions. This optimistic execution allows CoNDA to gather information on the memory accesses performed by the NDA and by the rest of the system. CoNDA exploits this information to avoid performing unnecessary coherence requests, and thus, significantly reducesdata movement for coherence. We show that CoNDA significantly improves performance andreduces energy consumption compared to prior coherence mechanisms. Third, we propose a hardware–software co-design approach aware of PIM for edge machinelearning (ML) accelerators to enable energy-efficient and high-performance inference execution. We analyze a commercial Edge TPU (tensor processing unit) using 24 Google edge neural network (NN) models (including CNNs, LSTMs, transducers, and RCNNs), and find that the accelerator suffers from three shortcomings, in terms of computational throughput, energy efficiency, andmemory access handling. We comprehensively study the characteristics of each NN layer in all of the Google edge models, and find that these shortcomings arise from the one-size-fits-all approach of the accelerator, as there is a high amount of heterogeneity in key layer characteristics both across different models and across different layers in the same model. To combat this inefficiency, we propose a new acceleration framework called Mensa. Mensa incorporates multiple heterogeneous ML edge accelerators (including both on-chip and near-data accelerators), each of which caters to the characteristics of a particular subset of models. At runtime, Mensa schedules each layer torun on the best-suited accelerator, accounting for both efficiency and inter-layer dependencies. We show that Mensa significantly improves inference energy and throughput, while reducing hardware cost and improving area efficiency over the Edge TPU and Eyeriss v2, two state-of-the-art edge ML accelerators. Lastly, we propose to redesign emerging modern hybrid databases to be aware of PIM capability, to enable real-time analysis. Hybrid transactional and analytical processing (HTAP) databasesystems can support real-time data analysis without the high costs of synchronizing across separate single-purpose databases. Unfortunately, for many applications that perform a high rate of data updates, state-of-the-art HTAP systems incur significant drops in transactional and/or analyticalthroughput compared to performing only transactions or only analytics in isolation, due to (1) data movement between the CPU and memory, (2) data update propagation, and (3) consistency costs. We propose Polynesia, a hardware–software co-designed system for in-memory HTAP databases. Polynesia (1) divides the HTAP system into transactional and analytical processing islands, (2) implements custom algorithms and hardware to reduce the costs of update propagation and consistency, and (3) exploits processing-in-memory for the analytical islands to alleviate data movement. We show that Polynesia significantly outperforms three state-of-the-art HTAP systems and reduces energy consumption.

read more

Citations
More filters
Journal ArticleDOI

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

TL;DR: In this paper, the authors perform a large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory.
Posted Content

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

TL;DR: In this article, the authors perform a large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory.
Proceedings ArticleDOI

Polynesia: Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Co-Design

TL;DR: Polynesia is proposed, a hardware-software co-designed system for in-memory HTAP databases that avoids the large throughput losses of traditional HTAP systems and reduces energy consumption by 48% over the prior lowest-energy HTAP sys-tem.
Journal ArticleDOI

DRAM Bender: An Extensible and Versatile FPGA-based Infrastructure to Easily Test State-of-the-art DRAM Chips

TL;DR: DRAM Bender as discussed by the authors is a FPGA-based infrastructure that enables experimental studies on state-of-the-art DRAM chips and exposes easy-to-use C++ and Python programming interfaces.
Journal ArticleDOI

DRAM Bender: An Extensible and Versatile FPGA-based Infrastructure to Easily Test State-of-the-art DRAM Chips

TL;DR: DRAM Bender as discussed by the authors is a FPGA-based infrastructure that enables experimental studies on state-of-the-art DRAM chips and exposes easy-to-use C++ and Python programming interfaces.
Related Papers (5)