A scalable processing-in-memory accelerator for parallel graph processing
Citations
57 citations
56 citations
Cites background or methods from "A scalable processing-in-memory acc..."
..., [2, 52]) treat an entire application thread as a PIM kernel, in order to minimize the amount of synchronization and data sharing that takes place between the main CPU and main compute-capable memory....
[...]
...We believe ample future work potential exists on examining other solutions for these two challenges as well as our solutions for them within the context of other in-memory accelerators, such as those described in [2, 20, 68, 92, 93, 195, 196, 199, 200]....
[...]
...Several works [2, 3, 21, 68] provide a detailed explanation of this process....
[...]
...Ultimately, there is significant time and energy wasted on moving data between the CPU and memory, many times with little benefit in return, especially in workloads where caching is not very effective [2, 3]....
[...]
...Examples of 3D-stacked DRAM include High-Bandwidth Memory (HBM) [75, 115] and the Hybrid Memory Cube (HMC) [2, 71, 72]....
[...]
55 citations
53 citations
53 citations
Cites background or methods from "A scalable processing-in-memory acc..."
..., the index of the first appearance of A,T ,C,G in sorted BR ; • OR [len + 1][4]: the occurrence array, i....
[...]
...• CR [4]: the accumulative count array, i....
[...]
...We expect applications such as graph processing [4], database searching [31], and sparse matrix computing [41] will also benefit from the proposed techniques....
[...]
...To perform the task described in Algorithm 1, the accelerator contains • registers to store the query sequence q, • a 4×64-bit register file to store CR [4], • a data reorganization engine to calculate OR [x] from its stored data structure, • two 64-bit unsigned adders to update Iuppper and I lower ,...
[...]
...1 Preprocess: Derive BR [len], SR [len], CR [4], and OR [len + 1][4]; 2 while I lower <= I do...
[...]
References
14,696 citations
"A scalable processing-in-memory acc..." refers methods in this paper
...Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems....
[...]
13,327 citations
5,629 citations
"A scalable processing-in-memory acc..." refers methods in this paper
...For this purpose, we use METIS [27] to perform 512-way multi-constraint partitioning to balance the number of vertices, outgoing edges, and incoming edges of each partition, as done in a recent previous work [51]....
[...]
...This is confirmed by the observation that Tesseract with METIS spends 59% of execution time waiting for synchronization barriers....
[...]
4,019 citations
"A scalable processing-in-memory acc..." refers methods in this paper
...We evaluate our architecture using an in-house cycle-accurate x86-64 simulator whose frontend is Pin [38]....
[...]
3,840 citations
"A scalable processing-in-memory acc..." refers methods in this paper
...Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems....
[...]
...It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model....
[...]