A scalable processing-in-memory accelerator for parallel graph processing
Citations
205 citations
Cites background from "A scalable processing-in-memory acc..."
...IMPICA addresses the key challenges of (1) how to achieve high parallelism in the presence of serial accesses in pointer chasing, and (2) how to effectively perform virtualto-physical address translation on the memory side without requiring expensive accesses to the CPU’s memory management unit....
[...]
...IMPICA also significantly reduces overall system energy consumption (by 41%, 23%, and 10% for the three commonly-used data structures, and by 6% for DBx1000)....
[...]
...We then discuss opportunities for acceleration within 3D-stacked memory....
[...]
200 citations
Cites background from "A scalable processing-in-memory acc..."
..., Processing-In Memory (PIM) [3,4,27,33], also known as Processing-Near Memory (PNM) or Near-Data Computing (NDC) [13]....
[...]
...With the significant advances in adoption of 3D-stacked memory technology that tightly combines a logic layer and DRAM layers [3, 4, 48, 64, 71, 86, 109], this limitation has been overcome and PIM has become a likelyviable approach to improve system design....
[...]
...3D-stacked memory technology brings new dimensions and better feasibility to PIM-based architectures [3, 4, 14, 15, 27, 39, 41, 44, 70, 71, 73, 87, 109]....
[...]
...As we discussed in Section 1, 3D-stacked memory technology enables the ability to place computational units in the base logic layer that is underneath the memory stacks [3,4,48,64,71,86,109]....
[...]
193 citations
Cites background from "A scalable processing-in-memory acc..."
..., [4,5,8,9,24]) have been proposed to exploit the logic layer to implement some computation close to DRAM....
[...]
184 citations
Cites background or methods from "A scalable processing-in-memory acc..."
...For NDP systems running MapReduce, graph processing, and deep neural networks, HRL achieves 92% of the peak performance of an NDP system based on custom accelerators for each application....
[...]
...CGRA arrays incur high power overheads due to the powerful interconnect for complicated data flow support [21], and are typically inefficient for irregular data and control flow patterns....
[...]
179 citations
Cites methods or result from "A scalable processing-in-memory acc..."
...[16] have tried to use METIS [17] to obtain a better partitioning for TESSERACT, but the result is not that promising....
[...]
...TESSERACT [16] is a PIM-enabled parallel graph processing architecture....
[...]
...In fact, the results in [16] confirms this observation: the bandwidth utilization of TESSERACT is usually less than 40%....
[...]
References
14,696 citations
"A scalable processing-in-memory acc..." refers methods in this paper
...Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems....
[...]
13,327 citations
5,629 citations
"A scalable processing-in-memory acc..." refers methods in this paper
...For this purpose, we use METIS [27] to perform 512-way multi-constraint partitioning to balance the number of vertices, outgoing edges, and incoming edges of each partition, as done in a recent previous work [51]....
[...]
...This is confirmed by the observation that Tesseract with METIS spends 59% of execution time waiting for synchronization barriers....
[...]
4,019 citations
"A scalable processing-in-memory acc..." refers methods in this paper
...We evaluate our architecture using an in-house cycle-accurate x86-64 simulator whose frontend is Pin [38]....
[...]
3,840 citations
"A scalable processing-in-memory acc..." refers methods in this paper
...Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems....
[...]
...It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model....
[...]