Exploiting the parallelism of large-scale application-layer networks by adaptive GPU-based simulation

doi:10.5555/2693848.2694281

Home
/
Papers
/
Exploiting the parallelism of large-scale application-layer networks by adaptive GPU-based simulation

Proceedings Article•DOI•

Exploiting the parallelism of large-scale application-layer networks by adaptive GPU-based simulation

Philipp Andelfinger¹, Hannes Hartenstein¹•Institutions (1)

Karlsruhe Institute of Technology¹

07 Dec 2014-pp 3471-3482

TL;DR: A GPU-based simulator engine that performs all steps of large-scale network simulations on a commodity many-core GPU and adapts its configuration at runtime in order to balance parallelism and overheads to achieve high performance for a given network model and scenario is presented.

read less

Abstract: We present a GPU-based simulator engine that performs all steps of large-scale network simulations on a commodity many-core GPU. Overhead is reduced by avoiding unnecessary data transfers between graphics memory and main memory. On the example of a widely deployed peer-to-peer network, we analyze the parallelism in large-scale application-layer networks, which suggests the use of thousands of concurrent processor cores for simulation. The proposed simulator employs the vast number of parallel cores in modern GPUs to exploit the identified parallelism and enables substantial simulation speedup. The simulator adapts its configuration at runtime in order to balance parallelism and overheads to achieve high performance for a given network model and scenario. A performance evaluation for simulations of networks comprising up to one million peers demonstrates a speedup of up to 19.5 compared with an efficient sequential implementation and shows the effectiveness of the runtime adaptation to different network conditions.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

A Survey on Agent-based Simulation Using Hardware Accelerators

[...]

Jiajian Xiao¹, Philipp Andelfinger², David Eckhoff¹, Wentong Cai², Alois Knoll² - Show less +1 more•Institutions (2)

Technische Universität München¹, Nanyang Technological University²

28 Jan 2019-ACM Computing Surveys

TL;DR: In this paper, the authors provide an overview and categorisation of the literature according to the applied techniques for agent-based simulations on hardware accelerators, and sketch directions for future research towards automating the hardware mapping and execution.

...read moreread less

Abstract: Due to decelerating gains in single-core CPU performance, computationally expensive simulations are increasingly executed on highly parallel hardware platforms. Agent-based simulations, where simulated entities act with a certain degree of autonomy, frequently provide ample opportunities for parallelisation. Thus, a vast variety of approaches proposed in the literature demonstrated considerable performance gains using hardware platforms such as many-core CPUs and GPUs, merged CPU-GPU chips as well as Field Programmable Gate Arrays. Typically, a combination of techniques is required to achieve high performance for a given simulation model, putting substantial burden on modellers. To the best of our knowledge, no systematic overview of techniques for agent-based simulations on hardware accelerators has been given in the literature. To close this gap, we provide an overview and categorisation of the literature according to the applied techniques. Since, at the current state of research, challenges such as the partitioning of a model for execution on heterogeneous hardware are still addressed in a largely manual process, we sketch directions for future research towards automating the hardware mapping and execution. This survey targets modellers seeking an overview of suitable hardware platforms and execution techniques for a specific simulation model, as well as methodology researchers interested in potential research gaps requiring further exploration.

...read moreread less

52 citations

Proceedings Article•DOI•

Time Warp on the GPU: Design and Assessment

[...]

Xinhu Liu¹, Philipp Andelfinger¹•Institutions (1)

Karlsruhe Institute of Technology¹

16 May 2017

TL;DR: This work presents the design and implementation of an optimistic fully GPU-based parallel discrete-event simulator based on the Time Warp synchronization algorithm, and shows that in most cases, the increase in parallelism when using optimistic synchronization significantly outweighs the increased overhead for state keeping and rollbacks.

...read moreread less

Abstract: The parallel execution of discrete-event simulations on commodity GPUs has been shown to achieve high event rates Most previous proposals have focused on conservative synchronization, which typically extracts only limited parallelism in cases of low event density in simulated time We present the design and implementation of an optimistic fully GPU-based parallel discrete-event simulator based on the Time Warp synchronization algorithm The optimistic simulator implementation is compared with an otherwise identical implementation using conservative synchronization Our evaluation shows that in most cases, the increase in parallelism when using optimistic synchronization significantly outweighs the increased overhead for state keeping and rollbacks To reduce the cost of state keeping, we show how XORWOW, the default pseudo-random number generator in CUDA, can be reversed based solely on its current state Since the optimal configuration of multiple performance-critical simulator parameters depends on the behavior of the simulation model, these parameters are adapted dynamically based on performance measurements and heuristic optimization at runtime We evaluate the simulator using the PHOLD benchmark model and a simplified model of peer-to-peer networks using the Kademlia protocol On a commodity GPU, the optimistic simulator achieves event rates of up to 814 million events per second and a speedup of up to 36 compared with conservative synchronization

...read moreread less

13 citations

Cites background from "Exploiting the parallelism of large..."

...proposed a fully GPU-based conservative simulator implementation that adapts the LP size at runtime to balance parallelism and event management overheads [1]....
[...]
...In fully GPU-based simulation [1, 13, 21, 22, 28, 32], the simulator core is executed on the GPU as well....
[...]

Posted Content•

A Survey on Agent-based Simulation using Hardware Accelerators

[...]

Jiajian Xiao¹, Philipp Andelfinger², David Eckhoff¹, Wentong Cai², Alois Knoll² - Show less +1 more•Institutions (2)

Technische Universität München¹, Nanyang Technological University²

03 Jul 2018-arXiv: Multiagent Systems

TL;DR: This survey targets modellers seeking an overview of suitable hardware platforms and execution techniques for a specific simulation model, as well as methodology researchers interested in potential research gaps requiring further exploration.

...read moreread less

Abstract: Due to decelerating gains in single-core CPU performance, computationally expensive simulations are increasingly executed on highly parallel hardware platforms. Agent-based simulations, where simulated entities act with a certain degree of autonomy, frequently provide ample opportunities for parallelisation. Thus, a vast variety of approaches proposed in the literature demonstrated considerable performance gains using hardware platforms such as many-core CPUs and GPUs, merged CPU-GPU chips as well as FPGAs. Typically, a combination of techniques is required to achieve high performance for a given simulation model, putting substantial burden on modellers. To the best of our knowledge, no systematic overview of techniques for agent-based simulations on hardware accelerators has been given in the literature. To close this gap, we provide an overview and categorisation of the literature according to the applied techniques. Since at the current state of research, challenges such as the partitioning of a model for execution on heterogeneous hardware are still a largely manual process, we sketch directions for future research towards automating the hardware mapping and execution. This survey targets modellers seeking an overview of suitable hardware platforms and execution techniques for a specific simulation model, as well as methodology researchers interested in potential research gaps requiring further exploration.

...read moreread less

10 citations

Cites background or methods from "Exploiting the parallelism of large..."

...Representation of irregular data structures by arrays and grids APU [180], GPU [47, 69, 98, 114, 144–146, 152, 166, 177] [7, 14, 95, 109, 140, 141, 159, 168, 183, 196], FPGA [121, 149]...
[...]
...Other works assume a minimum time delta between an event and its creation (lookahead) to guarantee the correctness of the simulation results [7, 157, 196]....
[...]
...Instead, the set of events is considered jointly in an unsorted fashion [168], split by model segment [159] or simulated entity [7, 109, 196], split according to a fixed policy [141, 183], or split randomly [121]....
[...]

Proceedings Article•DOI•

Performance Evaluation of Priority Queues for Fine-Grained Parallel Tasks on GPUs

[...]

Nikolai Baudis, Florian Jacob¹, Philipp Andelfinger¹•Institutions (1)

Karlsruhe Institute of Technology¹

01 Sep 2017

TL;DR: This work performs a performance evaluation of GPU-based priority queue implementations for two applications: discrete-event simulation and parallel A* path searches on grids and presents performance measurements covering linear queue designs, implicit binary heaps, splay trees, and a GPU-specific proposal from the literature.

...read moreread less

Abstract: Graphics processing units (GPUs) are increasingly applied to accelerate tasks such as graph problems and discreteevent simulation that are characterized by irregularity, i.e., a strong dependence of the control flow and memory accesses on the input. The core data structure in many of these irregular tasks are priority queues that guide the progress of the computations and which can easily become the bottleneck of an application. To our knowledge, currently no systematic comparison of priority queue implementations on GPUs exists in the literature. We close this gap by a performance evaluation of GPU-based priority queue implementations for two applications: discrete-event simulation and parallel A* path searches on grids. We focus on scenarios requiring large numbers of priority queues holding up to a few thousand items each. We present performance measurements covering linear queue designs, implicit binary heaps, splay trees, and a GPU-specific proposal from the literature. The measurement results show that up to about 500 items per queue, circular buffers frequently outperform tree-based queues for the considered applications, particularly under a simple parallelization of individual item enqueue operations. We analyze profiling metrics to explore classical queue designs in light of the importance of high hardware utilization as well as homogeneous computations and memory accesses across GPU threads.

...read moreread less

8 citations

Cites background or methods from "Exploiting the parallelism of large..."

...The literature proposes two solutions: first, merging queues of multiple simulated entities increases the probability of having events that can safely be executed [35]....
[...]
...Autotuning approaches, which have previously been shown to be highly beneficial in the GPU context [49], [35], might help in selecting a suitable queue....
[...]
...In [35], the number of simulated entities assigned to each LP is adapted to balance idle threads and the cost of queue operations....
[...]
...[35] store each LP’s events in a separate array....
[...]
...The considered parameter combinations were chosen according to our previous works in GPU-based simulation [35], [48] to cover cases of low utilization where the GPU could be outperformed by a single CPU core, up to configurations approaching full GPU utilization....
[...]

Proceedings Article•DOI•

Transitioning Spiking Neural Network Simulators to Heterogeneous Hardware

[...]

Quang Anh Pham Nguyen¹, Philipp Andelfinger¹, Wentong Cai¹, Alois Knoll²•Institutions (2)

Nanyang Technological University¹, Technische Universität München²

29 May 2019

TL;DR: This paper proposes a transition approach for CPU-based SNN simulators to enable the execution on heterogeneous hardware with only limited modifications to an existing simulator code base, and without changes to model code.

...read moreread less

Abstract: Spiking neural networks (SNN) are among the most computationally intensive types of simulation models, with node counts on the order of up to 10^11. Currently, there is intensive research into hardware platforms suitable to support large-scale SNN simulations, whereas several of the most widely used simulators still rely purely on the execution on CPUs. Enabling the execution of these established simulators on heterogeneous hardware allows new studies to exploit the many-core hardware prevalent in modern supercomputing environments, while still being able to reproduce and compare with results from a vast body of existing literature. In this paper, we propose a transition approach for CPU-based SNN simulators to enable the execution on heterogeneous hardware (e.g., CPUs, GPUs, and FPGAs) with only limited modifications to an existing simulator code base, and without changes to model code. Our approach relies on manual porting of a small number of core simulator functionalities as found in common SNN simulators, whereas unmodified model code is analyzed and transformed automatically. We apply our approach to the well-known simulator NEST and make a version executable on heterogeneous hardware available to the community. Our measurements show that at full utilization, a single GPU achieves the performance of about 9 CPU cores.

...read moreread less

4 citations

Cites background from "Exploiting the parallelism of large..."

..., [14, 19, 35]), including several types of network simulations [2, 3, 46]....
[...]

References

PDF

Open Access

More filters

Book Chapter•DOI•

Kademlia: A Peer-to-Peer Information System Based on the XOR Metric

[...]

Petar Maymounkov¹, David Mazières¹•Institutions (1)

New York University¹

07 Mar 2002

TL;DR: In this paper, the authors describe a peer-to-peer distributed hash table with provable consistency and performance in a fault-prone environment, which routes queries and locates nodes using a novel XOR-based metric topology.

...read moreread less

Abstract: We describe a peer-to-peer distributed hash table with provable consistency and performance in a fault-prone environment. Our system routes queries and locates nodes using a novel XOR-based metric topology that simplifies the algorithm and facilitates our proof. The topology has the property that every message exchanged conveys or reinforces useful contact information. The system exploits this information to send parallel, asynchronous query messages that tolerate node failures without imposing timeout delays on users.

...read moreread less

3,196 citations

Proceedings Article•DOI•

Inter-block GPU communication via fast barrier synchronization

[...]

Shucai Xiao¹, Wu-chun Feng¹•Institutions (1)

Virginia Tech¹

19 Apr 2010

TL;DR: This work proposes two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization andGPU lock-free synchronization and evaluates the efficacy of each approach via a micro-benchmark as well as three well-known algorithms — Fast Fourier Transform, dynamic programming, and bitonic sort.

...read moreread less

Abstract: While GPGPU stands for general-purpose computation on graphics processing units, the lack of explicit support for inter-block communication on the GPU arguably hampers its broader adoption as a general-purpose computing device. Interblock communication on the GPU occurs via global memory and then requires barrier synchronization across the blocks, i.e., inter-block GPU communication via barrier synchronization. Currently, such synchronization is only available via the CPU, which in turn, can incur significant overhead. We propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization. We then evaluate the efficacy of each approach via a micro-benchmark as well as three well-known algorithms — Fast Fourier Transform (FFT), dynamic programming, and bitonic sort. For the microbenchmark, the experimental results show that our GPU lock-free synchronization performs 8.4 times faster than CPU explicit synchronization and 4.0 times faster than CPU implicit synchronization. When integrated with the FFT, dynamic programming, and bitonic sort algorithms, our GPU lock-free synchronization further improves performance by 10%, 26%, and 40%, respectively, and ultimately delivers an overall speed-up of 70x, 13x, and 24x, respectively.

...read moreread less

299 citations

Journal Article•DOI•

The cost of conservative synchronization in parallel discrete event simulations

[...]

David M. Nicol¹•Institutions (1)

College of William & Mary¹

01 Apr 1993-Journal of the ACM

TL;DR: It is shown that on large problems—those for which parallel processing is ideally suited— there is often enough parallel workload so that processors are not usually idle, and the method is within a constant factor of optimal.

...read moreread less

Abstract: This paper analytically studies the performance of a synchronous conservative parallel discrete-event simulation protocol The class of models considered simulates activity in a physical domain, and possesses a limited ability to predict future behavior Using a stochastic model, it is shown that as the volume of simulation activity in the model increases relative to a fixed architecture, the complexity of the average per-event overhead due to synchronization, event list manipulation, lookahead calculations, and processor idle time approaches the complexity of the average per-event overhead of a serial simulation, sometimes rapidly The method is therefore within a constant factor of optimal The result holds for the worst case “fully-connected” communication topology, where an event in any other portion of the domain can cause an event in any other protion of the domain Our analysis demonstrates that on large problems—those for which parallel processing is ideally suited— there is often enough parallel workload so that processors are not usually idle It also demonstrated the viability of the method empirically, showing how good performance is achieved on large problems using a thirty-two node Intel iPSC/2 distributed memory multiprocessor

...read moreread less

202 citations

Efficient Parallel Scan Algorithms for GPUs

[...]

Shubhabrata Sengupta, Mark J. Harris, Michael Garland

01 Jan 2011

TL;DR: This paper describes the design of ecient scan and segmented scan parallel primitives in CUDA for execution on GPUs using a divide-and-conquer approach and demonstrates that this design methodology results in routines that are simple, highly ecient, and free of irregular access patterns that lead to memory bank conicts.

...read moreread less

Abstract: Scan and segmented scan algorithms are crucial building blocks for a great many data-parallel algorithms. Segmented scan and related primitives also provide the necessary support for the attening transform, which allows for nested data-parallel programs to be compiled into at data-parallel languages. In this paper, we describe the design of ecient scan and segmented scan parallel primitives in CUDA for execution on GPUs. Our algorithms are designed using a divide-and-conquer approach that builds all scan primitives on top of a set of primitive intra-warp scan routines. We demonstrate that this design methodology results in routines that are simple, highly ecient, and free of irregular access patterns that lead to memory bank conicts. These algorithms form the basis for current and upcoming releases of the widely used CUDPP library.

...read moreread less

160 citations

Proceedings Article•DOI•

Discrete-event Execution Alternatives on General Purpose Graphical Processing Units (GPGPUs)

[...]

Kalyan S. Perumalla¹•Institutions (1)

Oak Ridge National Laboratory¹

24 May 2006

TL;DR: Initial performance results on simulation of a diffusion process show that DES-style execution on GPGPU runs faster than DES on CPU and also significantly faster than time-stepped simulations on either CPU or GPG PU.

...read moreread less

Abstract: Graphics cards, traditionally designed as accelerators for computer graphics, have evolved to support more general-purpose computation. General Purpose Graphical Processing Units (GPGPUs) are now being used as highly efficient, cost-effective platforms for executing certain simulation applications. While most of these applications belong to the category of timestepped simulations, little is known about the applicability of GPGPUs to discrete event simulation (DES). Here, we identify some of the issues & challenges that the GPGPU stream-based interface raises for DES, and present some possible approaches to moving DES to GPGPUs. Initial performance results on simulation of a diffusion process show that DES-style execution on GPGPU runs faster than DES on CPU and also significantly faster than time-stepped simulations on either CPU or GPGPU.

...read moreread less

75 citations

"Exploiting the parallelism of large..." refers background in this paper

...Park, A., R. M. Fujimoto, and K. S. Perumalla....
[...]
...Perumalla, K. S. 2006....
[...]
...In some cases, a large speedup compared with a sequential simulation was achieved (Park, Fujimoto, and Perumalla 2004), while in other cases there were modest or no performance gains (Dinh, Lees, Theodoropoulos, and Minson 2008, Quinson, Rosa, and Thiery 2012)....
[...]
...In 2006, Perumalla (Perumalla 2006) proposed alternatives for GPU-based discrete-event simulations improving on a time-stepped execution method....
[...]