An effective GPU implementation of breadth-first search
read more
Citations
Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing
Scalable GPU graph traversal
CuSha: vertex-centric graph processing on GPUs
Parallel breadth-first search on distributed memory systems
Medusa: Simplified Graph Processing on GPUs
References
Accelerating large graph algorithms on the GPU using CUDA
Fast BVH Construction on GPUs
Inter-block GPU communication via fast barrier synchronization
Related Papers (5)
Frequently Asked Questions (13)
Q2. How do the authors get the offsets of B-Frontiers?
The authors use atomic operations to obtain the offsets of B-Frontiers so that parallel copying from B-Frontiers to the G-Frontier can be performed.
Q3. What is the way to speed up a graph algorithm?
Although parallel execution on a GPU can easily achieve speedup of tens or hundreds of times over straightforward CPU implementations, to accelerate intelligently designed and well optimized CPU algorithms is a very difficult job.
Q4. What is the way to build the frontier hierarchy?
A natural way to build the frontier hierarchy is to follow the two-level thread hierarchy, i.e. build the grid-level frontier based on block-level frontiers.
Q5. Why do the authors not expect many neighbors for each frontier vertex?
Since a lot of EDA problems such as circuit simulation, static timing analysis and routing are formulated on sparse graphs, the authors do not expect many neighbors for each frontier vertex.
Q6. what are the methods proposed in this paper?
Both methods proposed in this paper – hierarchical queue management and hierarchical kernel arrangement – are potentially applicable to the GPU implementations of other types of algorithms, too.
Q7. How many threads can be synchronized in a single block?
Once the frontier outgrows the capacity of one block, the authors use the second GPU synchronization strategy, which can handle frontiers as large as 15 360 vertices.
Q8. What is the general solution to synchronize the W-Frontier elements in parallel?
A general solution is to launch one kernel for each level and implement a global barrier between two launched kernels, which inflicts a huge kernel-launch overhead.
Q9. How many threads can be launched from the host CPU?
when launching such a single-block kernel from the host CPU, the authors always launch 512 threads regardless of the size of the current frontier.
Q10. What is the idea of a hierarchical queue structure?
The idea is that once the authors have quickly created the lower-level queues, the authors will know the exact location of each element in the higher-level queue, and therefore copy the elements to the higher-level queue in parallel.
Q11. How many threads are in a warp?
The authors do not know the scheduling order among warps, but once a warp is scheduled, it always runs T1-T8 first, followed by T9-T16, T17T24, and finally T25-T32.
Q12. What is the way to maintain the new frontier in a queue?
It is difficult to maintain the new frontier in a queue because different threads are all writing to the end of the samequeue and end up executing in sequence.
Q13. What is the difference between the two BFS procedures?
In this BFS procedure, each frontier propagation can be transformed into a matrix-vector multiplication; hence there are totally L multiplications, where L is the number of levels.