scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Exploiting the parallelism of large-scale application-layer networks by adaptive GPU-based simulation

07 Dec 2014-pp 3471-3482
TL;DR: A GPU-based simulator engine that performs all steps of large-scale network simulations on a commodity many-core GPU and adapts its configuration at runtime in order to balance parallelism and overheads to achieve high performance for a given network model and scenario is presented.
Abstract: We present a GPU-based simulator engine that performs all steps of large-scale network simulations on a commodity many-core GPU. Overhead is reduced by avoiding unnecessary data transfers between graphics memory and main memory. On the example of a widely deployed peer-to-peer network, we analyze the parallelism in large-scale application-layer networks, which suggests the use of thousands of concurrent processor cores for simulation. The proposed simulator employs the vast number of parallel cores in modern GPUs to exploit the identified parallelism and enables substantial simulation speedup. The simulator adapts its configuration at runtime in order to balance parallelism and overheads to achieve high performance for a given network model and scenario. A performance evaluation for simulations of networks comprising up to one million peers demonstrates a speedup of up to 19.5 compared with an efficient sequential implementation and shows the effectiveness of the runtime adaptation to different network conditions.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: In this paper, the authors provide an overview and categorisation of the literature according to the applied techniques for agent-based simulations on hardware accelerators, and sketch directions for future research towards automating the hardware mapping and execution.
Abstract: Due to decelerating gains in single-core CPU performance, computationally expensive simulations are increasingly executed on highly parallel hardware platforms. Agent-based simulations, where simulated entities act with a certain degree of autonomy, frequently provide ample opportunities for parallelisation. Thus, a vast variety of approaches proposed in the literature demonstrated considerable performance gains using hardware platforms such as many-core CPUs and GPUs, merged CPU-GPU chips as well as Field Programmable Gate Arrays. Typically, a combination of techniques is required to achieve high performance for a given simulation model, putting substantial burden on modellers. To the best of our knowledge, no systematic overview of techniques for agent-based simulations on hardware accelerators has been given in the literature. To close this gap, we provide an overview and categorisation of the literature according to the applied techniques. Since, at the current state of research, challenges such as the partitioning of a model for execution on heterogeneous hardware are still addressed in a largely manual process, we sketch directions for future research towards automating the hardware mapping and execution. This survey targets modellers seeking an overview of suitable hardware platforms and execution techniques for a specific simulation model, as well as methodology researchers interested in potential research gaps requiring further exploration.

52 citations

Proceedings ArticleDOI
16 May 2017
TL;DR: This work presents the design and implementation of an optimistic fully GPU-based parallel discrete-event simulator based on the Time Warp synchronization algorithm, and shows that in most cases, the increase in parallelism when using optimistic synchronization significantly outweighs the increased overhead for state keeping and rollbacks.
Abstract: The parallel execution of discrete-event simulations on commodity GPUs has been shown to achieve high event rates Most previous proposals have focused on conservative synchronization, which typically extracts only limited parallelism in cases of low event density in simulated time We present the design and implementation of an optimistic fully GPU-based parallel discrete-event simulator based on the Time Warp synchronization algorithm The optimistic simulator implementation is compared with an otherwise identical implementation using conservative synchronization Our evaluation shows that in most cases, the increase in parallelism when using optimistic synchronization significantly outweighs the increased overhead for state keeping and rollbacks To reduce the cost of state keeping, we show how XORWOW, the default pseudo-random number generator in CUDA, can be reversed based solely on its current state Since the optimal configuration of multiple performance-critical simulator parameters depends on the behavior of the simulation model, these parameters are adapted dynamically based on performance measurements and heuristic optimization at runtime We evaluate the simulator using the PHOLD benchmark model and a simplified model of peer-to-peer networks using the Kademlia protocol On a commodity GPU, the optimistic simulator achieves event rates of up to 814 million events per second and a speedup of up to 36 compared with conservative synchronization

13 citations


Cites background from "Exploiting the parallelism of large..."

  • ...proposed a fully GPU-based conservative simulator implementation that adapts the LP size at runtime to balance parallelism and event management overheads [1]....

    [...]

  • ...In fully GPU-based simulation [1, 13, 21, 22, 28, 32], the simulator core is executed on the GPU as well....

    [...]

Posted Content
TL;DR: This survey targets modellers seeking an overview of suitable hardware platforms and execution techniques for a specific simulation model, as well as methodology researchers interested in potential research gaps requiring further exploration.
Abstract: Due to decelerating gains in single-core CPU performance, computationally expensive simulations are increasingly executed on highly parallel hardware platforms. Agent-based simulations, where simulated entities act with a certain degree of autonomy, frequently provide ample opportunities for parallelisation. Thus, a vast variety of approaches proposed in the literature demonstrated considerable performance gains using hardware platforms such as many-core CPUs and GPUs, merged CPU-GPU chips as well as FPGAs. Typically, a combination of techniques is required to achieve high performance for a given simulation model, putting substantial burden on modellers. To the best of our knowledge, no systematic overview of techniques for agent-based simulations on hardware accelerators has been given in the literature. To close this gap, we provide an overview and categorisation of the literature according to the applied techniques. Since at the current state of research, challenges such as the partitioning of a model for execution on heterogeneous hardware are still a largely manual process, we sketch directions for future research towards automating the hardware mapping and execution. This survey targets modellers seeking an overview of suitable hardware platforms and execution techniques for a specific simulation model, as well as methodology researchers interested in potential research gaps requiring further exploration.

10 citations


Cites background or methods from "Exploiting the parallelism of large..."

  • ...Representation of irregular data structures by arrays and grids APU [180], GPU [47, 69, 98, 114, 144–146, 152, 166, 177] [7, 14, 95, 109, 140, 141, 159, 168, 183, 196], FPGA [121, 149]...

    [...]

  • ...Other works assume a minimum time delta between an event and its creation (lookahead) to guarantee the correctness of the simulation results [7, 157, 196]....

    [...]

  • ...Instead, the set of events is considered jointly in an unsorted fashion [168], split by model segment [159] or simulated entity [7, 109, 196], split according to a fixed policy [141, 183], or split randomly [121]....

    [...]

Proceedings ArticleDOI
01 Sep 2017
TL;DR: This work performs a performance evaluation of GPU-based priority queue implementations for two applications: discrete-event simulation and parallel A* path searches on grids and presents performance measurements covering linear queue designs, implicit binary heaps, splay trees, and a GPU-specific proposal from the literature.
Abstract: Graphics processing units (GPUs) are increasingly applied to accelerate tasks such as graph problems and discreteevent simulation that are characterized by irregularity, i.e., a strong dependence of the control flow and memory accesses on the input. The core data structure in many of these irregular tasks are priority queues that guide the progress of the computations and which can easily become the bottleneck of an application. To our knowledge, currently no systematic comparison of priority queue implementations on GPUs exists in the literature. We close this gap by a performance evaluation of GPU-based priority queue implementations for two applications: discrete-event simulation and parallel A* path searches on grids. We focus on scenarios requiring large numbers of priority queues holding up to a few thousand items each. We present performance measurements covering linear queue designs, implicit binary heaps, splay trees, and a GPU-specific proposal from the literature. The measurement results show that up to about 500 items per queue, circular buffers frequently outperform tree-based queues for the considered applications, particularly under a simple parallelization of individual item enqueue operations. We analyze profiling metrics to explore classical queue designs in light of the importance of high hardware utilization as well as homogeneous computations and memory accesses across GPU threads.

8 citations


Cites background or methods from "Exploiting the parallelism of large..."

  • ...The literature proposes two solutions: first, merging queues of multiple simulated entities increases the probability of having events that can safely be executed [35]....

    [...]

  • ...Autotuning approaches, which have previously been shown to be highly beneficial in the GPU context [49], [35], might help in selecting a suitable queue....

    [...]

  • ...In [35], the number of simulated entities assigned to each LP is adapted to balance idle threads and the cost of queue operations....

    [...]

  • ...[35] store each LP’s events in a separate array....

    [...]

  • ...The considered parameter combinations were chosen according to our previous works in GPU-based simulation [35], [48] to cover cases of low utilization where the GPU could be outperformed by a single CPU core, up to configurations approaching full GPU utilization....

    [...]

Proceedings ArticleDOI
29 May 2019
TL;DR: This paper proposes a transition approach for CPU-based SNN simulators to enable the execution on heterogeneous hardware with only limited modifications to an existing simulator code base, and without changes to model code.
Abstract: Spiking neural networks (SNN) are among the most computationally intensive types of simulation models, with node counts on the order of up to 10^11. Currently, there is intensive research into hardware platforms suitable to support large-scale SNN simulations, whereas several of the most widely used simulators still rely purely on the execution on CPUs. Enabling the execution of these established simulators on heterogeneous hardware allows new studies to exploit the many-core hardware prevalent in modern supercomputing environments, while still being able to reproduce and compare with results from a vast body of existing literature. In this paper, we propose a transition approach for CPU-based SNN simulators to enable the execution on heterogeneous hardware (e.g., CPUs, GPUs, and FPGAs) with only limited modifications to an existing simulator code base, and without changes to model code. Our approach relies on manual porting of a small number of core simulator functionalities as found in common SNN simulators, whereas unmodified model code is analyzed and transformed automatically. We apply our approach to the well-known simulator NEST and make a version executable on heterogeneous hardware available to the community. Our measurements show that at full utilization, a single GPU achieves the performance of about 9 CPU cores.

4 citations


Cites background from "Exploiting the parallelism of large..."

  • ..., [14, 19, 35]), including several types of network simulations [2, 3, 46]....

    [...]

References
More filters
Book ChapterDOI
07 Mar 2002
TL;DR: In this paper, the authors describe a peer-to-peer distributed hash table with provable consistency and performance in a fault-prone environment, which routes queries and locates nodes using a novel XOR-based metric topology.
Abstract: We describe a peer-to-peer distributed hash table with provable consistency and performance in a fault-prone environment. Our system routes queries and locates nodes using a novel XOR-based metric topology that simplifies the algorithm and facilitates our proof. The topology has the property that every message exchanged conveys or reinforces useful contact information. The system exploits this information to send parallel, asynchronous query messages that tolerate node failures without imposing timeout delays on users.

3,196 citations

Proceedings ArticleDOI
Shucai Xiao1, Wu-chun Feng1
19 Apr 2010
TL;DR: This work proposes two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization andGPU lock-free synchronization and evaluates the efficacy of each approach via a micro-benchmark as well as three well-known algorithms — Fast Fourier Transform, dynamic programming, and bitonic sort.
Abstract: While GPGPU stands for general-purpose computation on graphics processing units, the lack of explicit support for inter-block communication on the GPU arguably hampers its broader adoption as a general-purpose computing device. Interblock communication on the GPU occurs via global memory and then requires barrier synchronization across the blocks, i.e., inter-block GPU communication via barrier synchronization. Currently, such synchronization is only available via the CPU, which in turn, can incur significant overhead. We propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization. We then evaluate the efficacy of each approach via a micro-benchmark as well as three well-known algorithms — Fast Fourier Transform (FFT), dynamic programming, and bitonic sort. For the microbenchmark, the experimental results show that our GPU lock-free synchronization performs 8.4 times faster than CPU explicit synchronization and 4.0 times faster than CPU implicit synchronization. When integrated with the FFT, dynamic programming, and bitonic sort algorithms, our GPU lock-free synchronization further improves performance by 10%, 26%, and 40%, respectively, and ultimately delivers an overall speed-up of 70x, 13x, and 24x, respectively.

299 citations

Journal ArticleDOI
TL;DR: It is shown that on large problems—those for which parallel processing is ideally suited— there is often enough parallel workload so that processors are not usually idle, and the method is within a constant factor of optimal.
Abstract: This paper analytically studies the performance of a synchronous conservative parallel discrete-event simulation protocol The class of models considered simulates activity in a physical domain, and possesses a limited ability to predict future behavior Using a stochastic model, it is shown that as the volume of simulation activity in the model increases relative to a fixed architecture, the complexity of the average per-event overhead due to synchronization, event list manipulation, lookahead calculations, and processor idle time approaches the complexity of the average per-event overhead of a serial simulation, sometimes rapidly The method is therefore within a constant factor of optimal The result holds for the worst case “fully-connected” communication topology, where an event in any other portion of the domain can cause an event in any other protion of the domain Our analysis demonstrates that on large problems—those for which parallel processing is ideally suited— there is often enough parallel workload so that processors are not usually idle It also demonstrated the viability of the method empirically, showing how good performance is achieved on large problems using a thirty-two node Intel iPSC/2 distributed memory multiprocessor

202 citations

01 Jan 2011
TL;DR: This paper describes the design of ecient scan and segmented scan parallel primitives in CUDA for execution on GPUs using a divide-and-conquer approach and demonstrates that this design methodology results in routines that are simple, highly ecient, and free of irregular access patterns that lead to memory bank conicts.
Abstract: Scan and segmented scan algorithms are crucial building blocks for a great many data-parallel algorithms. Segmented scan and related primitives also provide the necessary support for the attening transform, which allows for nested data-parallel programs to be compiled into at data-parallel languages. In this paper, we describe the design of ecient scan and segmented scan parallel primitives in CUDA for execution on GPUs. Our algorithms are designed using a divide-and-conquer approach that builds all scan primitives on top of a set of primitive intra-warp scan routines. We demonstrate that this design methodology results in routines that are simple, highly ecient, and free of irregular access patterns that lead to memory bank conicts. These algorithms form the basis for current and upcoming releases of the widely used CUDPP library.

160 citations

Proceedings ArticleDOI
24 May 2006
TL;DR: Initial performance results on simulation of a diffusion process show that DES-style execution on GPGPU runs faster than DES on CPU and also significantly faster than time-stepped simulations on either CPU or GPG PU.
Abstract: Graphics cards, traditionally designed as accelerators for computer graphics, have evolved to support more general-purpose computation. General Purpose Graphical Processing Units (GPGPUs) are now being used as highly efficient, cost-effective platforms for executing certain simulation applications. While most of these applications belong to the category of timestepped simulations, little is known about the applicability of GPGPUs to discrete event simulation (DES). Here, we identify some of the issues & challenges that the GPGPU stream-based interface raises for DES, and present some possible approaches to moving DES to GPGPUs. Initial performance results on simulation of a diffusion process show that DES-style execution on GPGPU runs faster than DES on CPU and also significantly faster than time-stepped simulations on either CPU or GPGPU.

75 citations


"Exploiting the parallelism of large..." refers background in this paper

  • ...Park, A., R. M. Fujimoto, and K. S. Perumalla....

    [...]

  • ...Perumalla, K. S. 2006....

    [...]

  • ...In some cases, a large speedup compared with a sequential simulation was achieved (Park, Fujimoto, and Perumalla 2004), while in other cases there were modest or no performance gains (Dinh, Lees, Theodoropoulos, and Minson 2008, Quinson, Rosa, and Thiery 2012)....

    [...]

  • ...In 2006, Perumalla (Perumalla 2006) proposed alternatives for GPU-based discrete-event simulations improving on a time-stepped execution method....

    [...]