scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Performance Evaluation of Priority Queues for Fine-Grained Parallel Tasks on GPUs

TL;DR: This work performs a performance evaluation of GPU-based priority queue implementations for two applications: discrete-event simulation and parallel A* path searches on grids and presents performance measurements covering linear queue designs, implicit binary heaps, splay trees, and a GPU-specific proposal from the literature.
Abstract: Graphics processing units (GPUs) are increasingly applied to accelerate tasks such as graph problems and discreteevent simulation that are characterized by irregularity, i.e., a strong dependence of the control flow and memory accesses on the input. The core data structure in many of these irregular tasks are priority queues that guide the progress of the computations and which can easily become the bottleneck of an application. To our knowledge, currently no systematic comparison of priority queue implementations on GPUs exists in the literature. We close this gap by a performance evaluation of GPU-based priority queue implementations for two applications: discrete-event simulation and parallel A* path searches on grids. We focus on scenarios requiring large numbers of priority queues holding up to a few thousand items each. We present performance measurements covering linear queue designs, implicit binary heaps, splay trees, and a GPU-specific proposal from the literature. The measurement results show that up to about 500 items per queue, circular buffers frequently outperform tree-based queues for the considered applications, particularly under a simple parallelization of individual item enqueue operations. We analyze profiling metrics to explore classical queue designs in light of the importance of high hardware utilization as well as homogeneous computations and memory accesses across GPU threads.

Content maybe subject to copyright    Report

Citations
More filters
Peter Sanders1
01 Jan 1999
TL;DR: A fast priority queue suited to external memory and cached memory that is based on k-way merging is engineered that improves previous external memory algorithms by constant factors crucial for transferring it to cached memory.
Abstract: The cache hierarchy prevalent in todays high performance processors has to be taken into account in order to design algorithms that perform well in practice. This paper advocates the adaption of external memory algorithms to this purpose. This idea and the practical issues involved are exemplified by engineering a fast priority queue suited to external memory and cached memory that is based on k-way merging. It improves previous external memory algorithms by constant factors crucial for transferring it to cached memory. Running in the cache hierarchy of a workstation the algorithm is at least two times faster than an optimized implementation of binary heaps and 4-ary heaps for large inputs.

74 citations

Journal ArticleDOI
TL;DR: In this paper, the authors provide an overview and categorisation of the literature according to the applied techniques for agent-based simulations on hardware accelerators, and sketch directions for future research towards automating the hardware mapping and execution.
Abstract: Due to decelerating gains in single-core CPU performance, computationally expensive simulations are increasingly executed on highly parallel hardware platforms. Agent-based simulations, where simulated entities act with a certain degree of autonomy, frequently provide ample opportunities for parallelisation. Thus, a vast variety of approaches proposed in the literature demonstrated considerable performance gains using hardware platforms such as many-core CPUs and GPUs, merged CPU-GPU chips as well as Field Programmable Gate Arrays. Typically, a combination of techniques is required to achieve high performance for a given simulation model, putting substantial burden on modellers. To the best of our knowledge, no systematic overview of techniques for agent-based simulations on hardware accelerators has been given in the literature. To close this gap, we provide an overview and categorisation of the literature according to the applied techniques. Since, at the current state of research, challenges such as the partitioning of a model for execution on heterogeneous hardware are still addressed in a largely manual process, we sketch directions for future research towards automating the hardware mapping and execution. This survey targets modellers seeking an overview of suitable hardware platforms and execution techniques for a specific simulation model, as well as methodology researchers interested in potential research gaps requiring further exploration.

52 citations

Proceedings ArticleDOI
15 Oct 2018
TL;DR: Results show that a CPU-based parallelisation closely approaches the results of partial offloading, while full offloading substantially outperforms the other approaches and achieves a speedup of up to 28.7x over the sequential execution on a CPU.
Abstract: Microscopic traffic simulation is associated with substantial runtimes, limiting the feasibility of large-scale evaluation of traffic scenarios. Even though today heterogeneous hardware comprised of CPUs, graphics processing units (GPUs) and fused CPU-GPU devices is inexpensive and widely available, common traffic simulators still rely purely on CPU-based execution, leaving substantial acceleration potentials untapped. A number of existing works have considered the execution of traffic simulations on accelerators, but have relied on simplified models of road networks and driver behaviour tailored to the given hardware platform. Thus, the existing approaches cannot directly benefit from the vast body of research on the validity of common traffic simulation models. In this paper, we explore the performance gains achievable through the use of heterogeneous hardware when relying on typical traffic simulation models used in CPU-based simulators. We propose a partial offloading approach that relies either on a dedicated GPU or a fused CPU-GPU device. Further, we present a traffic simulation running fully on a many-core GPU and discuss the challenges of this approach. Our results show that a CPU-based parallelisation closely approaches the results of partial offloading, while full offloading substantially outperforms the other approaches. We achieve a speedup of up to 28.7x over the sequential execution on a CPU.

13 citations


Cites methods from "Performance Evaluation of Priority ..."

  • ...The implementation based on ring buffers and the synchronisation based on atomic operations closely resembles GPUbased discrete-event simulations, which have been shown to achieve high speedup over a CPU-based execution [35], [36]....

    [...]

Posted Content
TL;DR: This survey targets modellers seeking an overview of suitable hardware platforms and execution techniques for a specific simulation model, as well as methodology researchers interested in potential research gaps requiring further exploration.
Abstract: Due to decelerating gains in single-core CPU performance, computationally expensive simulations are increasingly executed on highly parallel hardware platforms. Agent-based simulations, where simulated entities act with a certain degree of autonomy, frequently provide ample opportunities for parallelisation. Thus, a vast variety of approaches proposed in the literature demonstrated considerable performance gains using hardware platforms such as many-core CPUs and GPUs, merged CPU-GPU chips as well as FPGAs. Typically, a combination of techniques is required to achieve high performance for a given simulation model, putting substantial burden on modellers. To the best of our knowledge, no systematic overview of techniques for agent-based simulations on hardware accelerators has been given in the literature. To close this gap, we provide an overview and categorisation of the literature according to the applied techniques. Since at the current state of research, challenges such as the partitioning of a model for execution on heterogeneous hardware are still a largely manual process, we sketch directions for future research towards automating the hardware mapping and execution. This survey targets modellers seeking an overview of suitable hardware platforms and execution techniques for a specific simulation model, as well as methodology researchers interested in potential research gaps requiring further exploration.

10 citations

Posted Content
TL;DR: The parBucketHeap is presented, a parallel, cache-efficient data structure designed for modern GPU architectures that supports standard priority queue operations, as well as bulk update and, using it, solves the single-source shortest path (SSSP) problem.
Abstract: The high computational throughput of modern graphics processing units (GPUs) make them the de-facto architecture for high-performance computing applications. However, to achieve peak performance, GPUs require highly parallel workloads, as well as memory access patterns that exhibit good locality of reference. As a result, many state-of-the-art algorithms and data structures designed for GPUs sacrifice work-optimality to achieve the necessary parallelism. Furthermore, some abstract data types are avoided completely due to there being no corresponding data structure that performs well on the GPU. One such abstract data type is the priority queue. Many well-known algorithms rely on priority queue operations as a building block. While various priority queue structures have been developed that are parallel, cache-aware, or cache-oblivious, none has been shown to be efficient on GPUs. In this paper, we present the parBucketHeap, a parallel, cache-efficient data structure designed for modern GPU architectures that supports standard priority queue operations, as well as bulk update. We analyze the structure in several well-known computational models and show that it provides both optimal parallelism and is cache-efficient. We implement the parBucketHeap and, using it, we solve the single-source shortest path (SSSP) problem. Experimental results indicate that, for sufficiently large, dense graphs with high diameter, we out-perform current state-of-the-art SSSP algorithms on the GPU by up to a factor of 5. Unlike existing GPU SSSP algorithms, our approach is work-optimal and places significantly less load on the GPU, reducing power consumption.

2 citations


Cites background from "Performance Evaluation of Priority ..."

  • ...However, many classical algorithms and data structures that were developed for sequential machines generate workloads that are not well-suited for highly parallel GPUs....

    [...]

  • ...In 2007, Harish et al. [22] demonstrated that a BellmanFord approach is well-suited to GPUs, achieving performance gains over sequential approaches for scale-free graphs....

    [...]

  • ...[13] more recently demonstrated that, for small queues of up to 500 items, simple circular buffers out-perform tree-based queues for a range of applications....

    [...]

  • ...In the past decade, graphics processing units (GPUs) have emerged as the most effective hardware architectures for solving computationally intensive problems....

    [...]

  • ...While various priority queue structures have been developed that are parallel, cache-aware, or cache-oblivious, none has been shown to be efficient on GPUs....

    [...]

References
More filters
Book
01 Jan 1968
TL;DR: The arrangement of this invention provides a strong vibration free hold-down mechanism while avoiding a large pressure drop to the flow of coolant fluid.
Abstract: A fuel pin hold-down and spacing apparatus for use in nuclear reactors is disclosed. Fuel pins forming a hexagonal array are spaced apart from each other and held-down at their lower end, securely attached at two places along their length to one of a plurality of vertically disposed parallel plates arranged in horizontally spaced rows. These plates are in turn spaced apart from each other and held together by a combination of spacing and fastening means. The arrangement of this invention provides a strong vibration free hold-down mechanism while avoiding a large pressure drop to the flow of coolant fluid. This apparatus is particularly useful in connection with liquid cooled reactors such as liquid metal cooled fast breeder reactors.

17,939 citations

Journal ArticleDOI
TL;DR: How heuristic information from the problem domain can be incorporated into a formal mathematical theory of graph searching is described and an optimality property of a class of search strategies is demonstrated.
Abstract: Although the problem of determining the minimum cost path through a graph arises naturally in a number of interesting applications, there has been no underlying theory to guide the development of efficient search procedures. Moreover, there is no adequate conceptual framework within which the various ad hoc search strategies proposed to date can be compared. This paper describes how heuristic information from the problem domain can be incorporated into a formal mathematical theory of graph searching and demonstrates an optimality property of a class of search strategies.

10,366 citations


"Performance Evaluation of Priority ..." refers methods in this paper

  • ...2) A* Path Search: The A* algorithm [18] is an extension to Dijkstra’s algorithm for finding shortest paths in a graph....

    [...]

Journal ArticleDOI
TL;DR: The splay tree, a self-adjusting form of binary search tree, is developed and analyzed and is found to be as efficient as balanced trees when total running time is the measure of interest.
Abstract: The splay tree, a self-adjusting form of binary search tree, is developed and analyzed. The binary search tree is a data structure for representing tables and lists so that accessing, inserting, and deleting items is easy. On an n-node splay tree, all the standard search tree operations have an amortized time bound of O(log n) per operation, where by “amortized time” is meant the time per operation averaged over a worst-case sequence of operations. Thus splay trees are as efficient as balanced trees when total running time is the measure of interest. In addition, for sufficiently long access sequences, splay trees are as efficient, to within a constant factor, as static optimum search trees. The efficiency of splay trees comes not from an explicit structural constraint, as with balanced trees, but from applying a simple restructuring heuristic, called splaying, whenever the tree is accessed. Extensions of splaying give simplified forms of two other data structures: lexicographic or multidimensional search trees and link/cut trees.

1,321 citations


"Performance Evaluation of Priority ..." refers background in this paper

  • ...At larger item counts, while the conclusions vary, heaps, splay trees [23] or more complex proposals such as the calendar queue [24] or ladder queue [25] achieved highest performance....

    [...]

  • ...Splay Tree: Splay trees, proposed by Sleator and Tarjan [23], are binary search trees that heuristically adjust to the access patterns to the tree’s items....

    [...]

Proceedings ArticleDOI
25 Feb 2012
TL;DR: This work presents a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity.
Abstract: Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter.We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations both CPU and GPU platforms.

541 citations


"Performance Evaluation of Priority ..." refers background in this paper

  • ...presented a parallelization of the breadth-first search problem achieving asymptotically optimal work complexity [12]....

    [...]