Practical Mechanisms for Reducing Processor–Memory Data Movement in Modern Workloads

doi:10.1184/R1/14642340.V1

Open AccessDissertationDOI

Practical Mechanisms for Reducing Processor–Memory Data Movement in Modern Workloads

Amirali Boroumand

Chats0

TLDR

In this paper, the authors propose a coherence mechanism for near-data processing (NDP) to reduce data movement between the memory system and computation units in modern workloads.

Abstract:

Data movement between the memory system and computation units is one of the most critical challenges in designing high performance and energy-efficient computing systems. The high cost of data movement is forcing architects to rethink the fundamental design of computer systems. Recent advances in memory design enable the opportunity for architects to avoid unnecessary datamovement by performing processing-in-memory (PIM), also known as near-data processing (NDP). While PIM can allow many data-intensive applications to avoid moving data from memory to the CPU, it introduces new challenges for system architects and programmers. Our goal in this thesisis to make PIM effective and practical in conventional computing systems. Toward this end, this thesis presents three major directions: (1) examining the suitability of PIM across key workloads, (2) addressing major system challenges for adopting PIM in computing systems, and (3) redesigning applications aware of PIM capability. In line with these three major directions, we propose a seriesof practical mechanisms to reduce processor–memory data movement in modern workloads: First, we comprehensively analyze the energy and performance impact of data movement for several widely-used Google consumer workloads. We find that PIM can significantly reduce datamovement for all of these workloads, by performing part of the computation close to memory. Each workload contains simple primitives and functions that contribute to a significant amount of the overall data movement. We investigate whether these primitives and functions are feasible to implement using PIM, given the limited area and power constraints of consumer devices. Our analysis shows that offloading these primitives to PIM logic, consisting of either simple cores or specialized accelerators, eliminates a large amount of data movement, and significantly reducestotal system energy execution time. Second, we address one of the key system challenges for communication with PIM logic by proposing an efficient cache coherence support for near-data accelerators (NDAs). We find that enforcing coherence with the rest of the system, which is already a major challenge for on-chip accelerators, becomes more difficult for NDAs. This is because (1) the cost of communication between NDAs and CPUs is high, and (2) NDA applications generate a large amount of off-chip datamovement. As a result, as we show in this work, existing coherence mechanisms eliminate most of the benefits of NDAs. Based on our observations, we propose CoNDA, a coherence mechanism that lets an NDA optimistically execute an NDA kernel, under the assumption that the NDA has all necessary coherence permissions. This optimistic execution allows CoNDA to gather information on the memory accesses performed by the NDA and by the rest of the system. CoNDA exploits this information to avoid performing unnecessary coherence requests, and thus, significantly reducesdata movement for coherence. We show that CoNDA significantly improves performance andreduces energy consumption compared to prior coherence mechanisms. Third, we propose a hardware–software co-design approach aware of PIM for edge machinelearning (ML) accelerators to enable energy-efficient and high-performance inference execution. We analyze a commercial Edge TPU (tensor processing unit) using 24 Google edge neural network (NN) models (including CNNs, LSTMs, transducers, and RCNNs), and find that the accelerator suffers from three shortcomings, in terms of computational throughput, energy efficiency, andmemory access handling. We comprehensively study the characteristics of each NN layer in all of the Google edge models, and find that these shortcomings arise from the one-size-fits-all approach of the accelerator, as there is a high amount of heterogeneity in key layer characteristics both across different models and across different layers in the same model. To combat this inefficiency, we propose a new acceleration framework called Mensa. Mensa incorporates multiple heterogeneous ML edge accelerators (including both on-chip and near-data accelerators), each of which caters to the characteristics of a particular subset of models. At runtime, Mensa schedules each layer torun on the best-suited accelerator, accounting for both efficiency and inter-layer dependencies. We show that Mensa significantly improves inference energy and throughput, while reducing hardware cost and improving area efficiency over the Edge TPU and Eyeriss v2, two state-of-the-art edge ML accelerators. Lastly, we propose to redesign emerging modern hybrid databases to be aware of PIM capability, to enable real-time analysis. Hybrid transactional and analytical processing (HTAP) databasesystems can support real-time data analysis without the high costs of synchronizing across separate single-purpose databases. Unfortunately, for many applications that perform a high rate of data updates, state-of-the-art HTAP systems incur significant drops in transactional and/or analyticalthroughput compared to performing only transactions or only analytics in isolation, due to (1) data movement between the CPU and memory, (2) data update propagation, and (3) consistency costs. We propose Polynesia, a hardware–software co-designed system for in-memory HTAP databases. Polynesia (1) divides the HTAP system into transactional and analytical processing islands, (2) implements custom algorithms and hardware to reduce the costs of update propagation and consistency, and (3) exploits processing-in-memory for the analytical islands to alleviate data movement. We show that Polynesia significantly outperforms three state-of-the-art HTAP systems and reduces energy consumption.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

Geraldo F. Oliveira, +7 more

- 08 Sep 2021 -

IEEE Access

TL;DR: In this paper, the authors perform a large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory.

...read moreread less

Posted Content

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

Geraldo F. Oliveira, +7 more

- 08 May 2021 -

arXiv: Hardware Architecture

TL;DR: In this article, the authors perform a large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory.

...read moreread less

Proceedings ArticleDOI

Polynesia: Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Co-Design

Amir Bahador Boroumand, +3 more

TL;DR: Polynesia is proposed, a hardware-software co-designed system for in-memory HTAP databases that avoids the large throughput losses of traditional HTAP systems and reduces energy consumption by 48% over the prior lowest-energy HTAP sys-tem.

...read moreread less

Journal ArticleDOI

DRAM Bender: An Extensible and Versatile FPGA-based Infrastructure to Easily Test State-of-the-art DRAM Chips

Ataberk Olgun, +8 more

- 10 Nov 2022 -

IEEE Transactions on Computer-Aided Desi...

TL;DR: DRAM Bender as discussed by the authors is a FPGA-based infrastructure that enables experimental studies on state-of-the-art DRAM chips and exposes easy-to-use C++ and Python programming interfaces.

...read moreread less

Journal ArticleDOI

DRAM Bender: An Extensible and Versatile FPGA-based Infrastructure to Easily Test State-of-the-art DRAM Chips

Yahya Can Tugrul, +8 more

- 10 Nov 2022 -

IEEE Transactions on Computer-Aided Desi...

TL;DR: DRAM Bender as discussed by the authors is a FPGA-based infrastructure that enables experimental studies on state-of-the-art DRAM chips and exposes easy-to-use C++ and Python programming interfaces.

...read moreread less

arXiv: Performance

Data Subsetting: A Data-Centric Approach to Approximate Computing

Younghoon Kim, +3 more

Data movement aware computation partitioning

Xulong Tang, +3 more

Practical Mechanisms for Reducing Processor–Memory Data Movement in Modern Workloads

Citations

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

Polynesia: Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Co-Design

DRAM Bender: An Extensible and Versatile FPGA-based Infrastructure to Easily Test State-of-the-art DRAM Chips

DRAM Bender: An Extensible and Versatile FPGA-based Infrastructure to Easily Test State-of-the-art DRAM Chips

Related Papers (5)

Practical challenges in delivering the promises of real processing-in-memory machines

Decentralized Offload-based Execution on Memory-centric Compute Cores

Optimizing Memory-Access Patterns for Deep Learning Accelerators.

Data Subsetting: A Data-Centric Approach to Approximate Computing

Data movement aware computation partitioning