(PDF) A scalable processing-in-memory accelerator for parallel graph processing (2015) | Junwhan Ahn

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Memory Coalescing for Hybrid Memory Cube

[...]

Xi Wang¹, John D. Leidel¹, Yong Chen¹•Institutions (1)

Texas Tech University¹

13 Aug 2018

TL;DR: A novel memory coalescer methodology is introduced that facilitates memory bandwidth efficiency and the overall performance through an efficient and scalable memory request coalescing interface for HMC.

...read moreread less

Abstract: Arguably, many data-intensive applications pose significant challenges to conventional architectures and memory systems, especially when applications exhibit non-contiguous, irregular, and small memory access patterns. The long memory access latency can dramatically slow down the overall performance of applications. The growing desire of high memory bandwidth and low latency access stimulate the advent of novel 3D-staked memory devices such as the Hybrid Memory Cube (HMC), which provides significantly higher bandwidth compared with the conventional JEDEC DDR devices. Even though many existing studies have been devoted to achieving high bandwidth throughput of HMC, the bandwidth potential cannot be fully exploited due to the lack of highly efficient memory coalescing and interfacing methodology for HMC devices. In this research, we introduce a novel memory coalescer methodology that facilitates memory bandwidth efficiency and the overall performance through an efficient and scalable memory request coalescing interface for HMC. We present the design and implementation of this approach on RISC-V embedded cores with attached HMC devices. Our evaluation results show that the new memory coalescer eliminates 47.47% memory accesses to HMC and improves the overall performance by 13.14% on average.

...read moreread less

4 citations

Cites background from "A scalable processing-in-memory acc..."

...Given that many data-intensive applications rarely achieve the desired performance with traditional architectures, research and development efforts for efficiently handling massive memory accesses have drawn increasing attention in recent years [8, 41]....
[...]

Posted Content•

The Bitlet Model: A Parameterized Analytical Model to Compare PIM and CPU Systems.

[...]

Ronny Ronen, Adi Eliahu, Orian Leitersdorf, Natan Peled, Kunal Korgaonkar, Anupam Chattopadhyay, Ben Perach, Shahar Kvatinsky - Show less +4 more

21 Jul 2021-arXiv: Hardware Architecture

TL;DR: Bitlet as discussed by the authors is an analytical modeling tool that can be used, in a parameterized fashion, to estimate the performance and the power/energy of a PIM-based system and thereby assess the affinity of workloads for PIM as opposed to traditional computing.

...read moreread less

Abstract: Nowadays, data-intensive applications are gaining popularity and, together with this trend, processing-in-memory (PIM)-based systems are being given more attention and have become more relevant. This paper describes an analytical modeling tool called Bitlet that can be used, in a parameterized fashion, to estimate the performance and the power/energy of a PIM-based system and thereby assess the affinity of workloads for PIM as opposed to traditional computing. The tool uncovers interesting tradeoffs between, mainly, the PIM computation complexity (cycles required to perform a computation through PIM), the amount of memory used for PIM, the system memory bandwidth, and the data transfer size. Despite its simplicity, the model reveals new insights when applied to real-life examples. The model is demonstrated for several synthetic examples and then applied to explore the influence of different parameters on two systems - IMAGING and FloatPIM. Based on the demonstrations, insights about PIM and its combination with CPU are concluded.

...read moreread less

4 citations

Proceedings Article•DOI•

MessageFusion: On-path Message Coalescing for Energy Efficient and Scalable Graph Analytics

[...]

Leul Belayneh¹, Abraham Addisie¹, Valeria Bertacco¹•Institutions (1)

University of Michigan¹

29 Jul 2019

TL;DR: This work proposes MessageFusion, a domain-specific architecture that greatly reduces network traffic by computing many vertex-updates at the source node, as well as in the network, and leverages a novel edge-reordering mechanism to boost the number of partial update operations that can be completed before reaching their destination.

...read moreread less

Abstract: The natural ability of graphs to capture complex relationships within a large amount of data makes graph-based algorithms critical kernels for a wide range of data-analytics applications. Recent processing-in-memory solutions based on a network of 3D-stacked memory, such as Hybrid Memory Cubes (HMCs), have been shown to be a good fit to run graph-based algorithms. However, the communication bandwidth of the network limits their energy and performance efficiencies. In this work, we propose MessageFusion, a domain-specific architecture that greatly reduces network traffic by computing many vertex-updates at the source node, as well as in the network. To this end, we observe that, for many algorithms, vertex-updates need not be atomic, but can be decomposed and computed in a distributed manner. MessageFusion leverages a novel edge-reordering mechanism to boost the number of partial update operations that can be completed before reaching their destination. In addition, to counteract the power overhead introduced by Message-Fusion’s edge-reordering mechanism, our solution employs module-level utilization-based, power-gating techniques. Our experimental evaluation shows that MessageFusion achieves a 3× energy savings over a highly-optimized processing-in-memory solution, while also improving performance by 2.1×, on average.

...read moreread less

4 citations

Cites background or methods from "A scalable processing-in-memory acc..."

...5%, leading to a power density of 124 mW/mm2, which is still under the thermal constraint (133 mW/mm2) reported in [5]....
[...]
...We refer to prior works for estimation of the logic layer, including the SerDes links, and the DRAM layers of the HMC [5],...
[...]
...(rather than conventional caches of [5])....
[...]
...Finally, our baseline architecture includes an edgeprefetcher as in [5] to stream the edgeArray from the local vault’s memory....
[...]
...Specifically, we simulated Tesseract [5], a domain-specific architecture, with 16 HMCs and 32 vaults per HMC, on an infrastructure based on...
[...]

Journal Article•DOI•

Using Chiplet Encapsulation Technology to Achieve Processing-in-Memory Functions

[...]

Wenchao Tian, Bin Li, Zhao Li, Hao Cui, Jing Shi, Yongkun Wang, Jingrong Zhao - Show less +3 more

01 Oct 2022-Micromachines

TL;DR: Chiplet combines processor cores and memory chips with advanced packaging technologies, such as 2.5D, 3 dimensions (3D), and fan-out packaging, that can realize the function of PIM and analyzes some of its application results.

...read moreread less

Abstract: With the rapid development of 5G, artificial intelligence (AI), and high-performance computing (HPC), there is a huge increase in the data exchanged between the processor and memory. However, the “storage wall” caused by the von Neumann architecture severely limits the computational performance of the system. To efficiently process such large amounts of data and break up the “storage wall”, it is necessary to develop processing-in-memory (PIM) technology. Chiplet combines processor cores and memory chips with advanced packaging technologies, such as 2.5D, 3 dimensions (3D), and fan-out packaging. This improves the quality and bandwidth of signal transmission and alleviates the “storage wall” problem. This paper reviews the Chiplet packaging technology that has achieved the function of PIM in recent years and analyzes some of its application results. First, the research status and development direction of PIM are presented and summarized. Second, the Chiplet packaging technologies that can realize the function of PIM are introduced, which are divided into 2.5D, 3D packaging, and fan-out packaging according to their physical form. Further, the form and characteristics of their implementation of PIM are summarized. Finally, this paper is concluded, and the future development of Chiplet in the field of PIM is discussed.

...read moreread less

4 citations

Journal Article•DOI•

DStore : A Holistic Key-Value Store Exploring Near-Data Processing and On-Demand Scheduling for Compaction Optimization

[...]

Hui Sun¹, Wei Liu¹, Zhi Qiao², Song Fu², Weisong Shi³ - Show less +1 more•Institutions (3)

Anhui University¹, University of North Texas², Wayne State University³

04 Oct 2018-IEEE Access

TL;DR: A holistic key-value store to explorer near-data processing (NDP) and on-demand scheduling for compaction optimization in an LSM-tree key- value store, named DStore, which not only accomplishes compaction for key- Value stores but also improves the system performance.

...read moreread less

Abstract: Log-structured merge tree (LSM-tree)-based key-value stores are widely deployed in large-scale storage systems. The underlying reason is that the traditional relational databases cannot reach the high performance required by big-data applications. As high-throughput alternatives to relational databases, LSM-tree-based key-value stores can support high-throughput write operations and provide high sequential bandwidth in storage systems. However, the compaction process triggers write amplification and is confronted with the degraded write performance, especially under update-intensive workloads. To address this issue, we design a holistic key-value store to explorer near-data processing (NDP) and on-demand scheduling for compaction optimization in an LSM-tree key-value store, named DStore. DStore makes full use of various computing capacities in the host-side and device-side subsystems. DStore dynamically divides the whole host-side compaction tasks into the above two-side subsystems according to two-side different computing capabilities. Meanwhile, the device must be featured with an NDP model. The divided compaction tasks are performed by the host and the device in parallel. In DStore, the NDP-based devices exhibit low-latency and high-bandwidth performance, thus facilitating key-value stores. DStore not only accomplishes compaction for key-value stores but also improves the system performance. We implement our DStore prototype in a real-world platform, and different kinds of testbeds are employed in our experiment. LevelDB and a static compaction optimization using the NDP model (called Co-KV) are used to compare with the DStore in our evaluation. Results show that DStore achieves about $3.7 \times $ performance improvement over LevelDB under the db_bench workload. In addition, DStore-enabled key-value stores outperform LevelDB by a factor of about $3.3 \times $ and 77% in terms of throughput and latency under YCSB benchmark, respectively.

...read moreread less

4 citations

Collapse

A scalable processing-in-memory accelerator for parallel graph processing

Citations

Cites background from "A scalable processing-in-memory acc..."

Cites background or methods from "A scalable processing-in-memory acc..."

References

"A scalable processing-in-memory acc..." refers methods in this paper

"A scalable processing-in-memory acc..." refers methods in this paper

"A scalable processing-in-memory acc..." refers methods in this paper

"A scalable processing-in-memory acc..." refers methods in this paper

Related Papers (5)