Exploiting the parallelism of large-scale application-layer networks by adaptive GPU-based simulation

doi:10.5555/2693848.2694281

Home
/
Papers
/
Exploiting the parallelism of large-scale application-layer networks by adaptive GPU-based simulation

Proceedings Article•DOI•

Exploiting the parallelism of large-scale application-layer networks by adaptive GPU-based simulation

Philipp Andelfinger¹, Hannes Hartenstein¹•Institutions (1)

Karlsruhe Institute of Technology¹

07 Dec 2014-pp 3471-3482

TL;DR: A GPU-based simulator engine that performs all steps of large-scale network simulations on a commodity many-core GPU and adapts its configuration at runtime in order to balance parallelism and overheads to achieve high performance for a given network model and scenario is presented.

read less

Abstract: We present a GPU-based simulator engine that performs all steps of large-scale network simulations on a commodity many-core GPU. Overhead is reduced by avoiding unnecessary data transfers between graphics memory and main memory. On the example of a widely deployed peer-to-peer network, we analyze the parallelism in large-scale application-layer networks, which suggests the use of thousands of concurrent processor cores for simulation. The proposed simulator employs the vast number of parallel cores in modern GPUs to exploit the identified parallelism and enables substantial simulation speedup. The simulator adapts its configuration at runtime in order to balance parallelism and overheads to achieve high performance for a given network model and scenario. A performance evaluation for simulations of networks comprising up to one million peers demonstrates a speedup of up to 19.5 compared with an efficient sequential implementation and shows the effectiveness of the runtime adaptation to different network conditions.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

A Survey on Agent-based Simulation Using Hardware Accelerators

[...]

Jiajian Xiao¹, Philipp Andelfinger², David Eckhoff¹, Wentong Cai², Alois Knoll² - Show less +1 more•Institutions (2)

Technische Universität München¹, Nanyang Technological University²

28 Jan 2019-ACM Computing Surveys

TL;DR: In this paper, the authors provide an overview and categorisation of the literature according to the applied techniques for agent-based simulations on hardware accelerators, and sketch directions for future research towards automating the hardware mapping and execution.

...read moreread less

Abstract: Due to decelerating gains in single-core CPU performance, computationally expensive simulations are increasingly executed on highly parallel hardware platforms. Agent-based simulations, where simulated entities act with a certain degree of autonomy, frequently provide ample opportunities for parallelisation. Thus, a vast variety of approaches proposed in the literature demonstrated considerable performance gains using hardware platforms such as many-core CPUs and GPUs, merged CPU-GPU chips as well as Field Programmable Gate Arrays. Typically, a combination of techniques is required to achieve high performance for a given simulation model, putting substantial burden on modellers. To the best of our knowledge, no systematic overview of techniques for agent-based simulations on hardware accelerators has been given in the literature. To close this gap, we provide an overview and categorisation of the literature according to the applied techniques. Since, at the current state of research, challenges such as the partitioning of a model for execution on heterogeneous hardware are still addressed in a largely manual process, we sketch directions for future research towards automating the hardware mapping and execution. This survey targets modellers seeking an overview of suitable hardware platforms and execution techniques for a specific simulation model, as well as methodology researchers interested in potential research gaps requiring further exploration.

...read moreread less

52 citations

Proceedings Article•DOI•

Time Warp on the GPU: Design and Assessment

[...]

Xinhu Liu¹, Philipp Andelfinger¹•Institutions (1)

Karlsruhe Institute of Technology¹

16 May 2017

TL;DR: This work presents the design and implementation of an optimistic fully GPU-based parallel discrete-event simulator based on the Time Warp synchronization algorithm, and shows that in most cases, the increase in parallelism when using optimistic synchronization significantly outweighs the increased overhead for state keeping and rollbacks.

...read moreread less

Abstract: The parallel execution of discrete-event simulations on commodity GPUs has been shown to achieve high event rates Most previous proposals have focused on conservative synchronization, which typically extracts only limited parallelism in cases of low event density in simulated time We present the design and implementation of an optimistic fully GPU-based parallel discrete-event simulator based on the Time Warp synchronization algorithm The optimistic simulator implementation is compared with an otherwise identical implementation using conservative synchronization Our evaluation shows that in most cases, the increase in parallelism when using optimistic synchronization significantly outweighs the increased overhead for state keeping and rollbacks To reduce the cost of state keeping, we show how XORWOW, the default pseudo-random number generator in CUDA, can be reversed based solely on its current state Since the optimal configuration of multiple performance-critical simulator parameters depends on the behavior of the simulation model, these parameters are adapted dynamically based on performance measurements and heuristic optimization at runtime We evaluate the simulator using the PHOLD benchmark model and a simplified model of peer-to-peer networks using the Kademlia protocol On a commodity GPU, the optimistic simulator achieves event rates of up to 814 million events per second and a speedup of up to 36 compared with conservative synchronization

...read moreread less

13 citations

Cites background from "Exploiting the parallelism of large..."

...proposed a fully GPU-based conservative simulator implementation that adapts the LP size at runtime to balance parallelism and event management overheads [1]....
[...]
...In fully GPU-based simulation [1, 13, 21, 22, 28, 32], the simulator core is executed on the GPU as well....
[...]

Posted Content•

A Survey on Agent-based Simulation using Hardware Accelerators

[...]

Jiajian Xiao¹, Philipp Andelfinger², David Eckhoff¹, Wentong Cai², Alois Knoll² - Show less +1 more•Institutions (2)

Technische Universität München¹, Nanyang Technological University²

03 Jul 2018-arXiv: Multiagent Systems

TL;DR: This survey targets modellers seeking an overview of suitable hardware platforms and execution techniques for a specific simulation model, as well as methodology researchers interested in potential research gaps requiring further exploration.

...read moreread less

Abstract: Due to decelerating gains in single-core CPU performance, computationally expensive simulations are increasingly executed on highly parallel hardware platforms. Agent-based simulations, where simulated entities act with a certain degree of autonomy, frequently provide ample opportunities for parallelisation. Thus, a vast variety of approaches proposed in the literature demonstrated considerable performance gains using hardware platforms such as many-core CPUs and GPUs, merged CPU-GPU chips as well as FPGAs. Typically, a combination of techniques is required to achieve high performance for a given simulation model, putting substantial burden on modellers. To the best of our knowledge, no systematic overview of techniques for agent-based simulations on hardware accelerators has been given in the literature. To close this gap, we provide an overview and categorisation of the literature according to the applied techniques. Since at the current state of research, challenges such as the partitioning of a model for execution on heterogeneous hardware are still a largely manual process, we sketch directions for future research towards automating the hardware mapping and execution. This survey targets modellers seeking an overview of suitable hardware platforms and execution techniques for a specific simulation model, as well as methodology researchers interested in potential research gaps requiring further exploration.

...read moreread less

10 citations

Cites background or methods from "Exploiting the parallelism of large..."

...Representation of irregular data structures by arrays and grids APU [180], GPU [47, 69, 98, 114, 144–146, 152, 166, 177] [7, 14, 95, 109, 140, 141, 159, 168, 183, 196], FPGA [121, 149]...
[...]
...Other works assume a minimum time delta between an event and its creation (lookahead) to guarantee the correctness of the simulation results [7, 157, 196]....
[...]
...Instead, the set of events is considered jointly in an unsorted fashion [168], split by model segment [159] or simulated entity [7, 109, 196], split according to a fixed policy [141, 183], or split randomly [121]....
[...]

Proceedings Article•DOI•

Performance Evaluation of Priority Queues for Fine-Grained Parallel Tasks on GPUs

[...]

Nikolai Baudis, Florian Jacob¹, Philipp Andelfinger¹•Institutions (1)

Karlsruhe Institute of Technology¹

01 Sep 2017

TL;DR: This work performs a performance evaluation of GPU-based priority queue implementations for two applications: discrete-event simulation and parallel A* path searches on grids and presents performance measurements covering linear queue designs, implicit binary heaps, splay trees, and a GPU-specific proposal from the literature.

...read moreread less

Abstract: Graphics processing units (GPUs) are increasingly applied to accelerate tasks such as graph problems and discreteevent simulation that are characterized by irregularity, i.e., a strong dependence of the control flow and memory accesses on the input. The core data structure in many of these irregular tasks are priority queues that guide the progress of the computations and which can easily become the bottleneck of an application. To our knowledge, currently no systematic comparison of priority queue implementations on GPUs exists in the literature. We close this gap by a performance evaluation of GPU-based priority queue implementations for two applications: discrete-event simulation and parallel A* path searches on grids. We focus on scenarios requiring large numbers of priority queues holding up to a few thousand items each. We present performance measurements covering linear queue designs, implicit binary heaps, splay trees, and a GPU-specific proposal from the literature. The measurement results show that up to about 500 items per queue, circular buffers frequently outperform tree-based queues for the considered applications, particularly under a simple parallelization of individual item enqueue operations. We analyze profiling metrics to explore classical queue designs in light of the importance of high hardware utilization as well as homogeneous computations and memory accesses across GPU threads.

...read moreread less

8 citations

Cites background or methods from "Exploiting the parallelism of large..."

...The literature proposes two solutions: first, merging queues of multiple simulated entities increases the probability of having events that can safely be executed [35]....
[...]
...Autotuning approaches, which have previously been shown to be highly beneficial in the GPU context [49], [35], might help in selecting a suitable queue....
[...]
...In [35], the number of simulated entities assigned to each LP is adapted to balance idle threads and the cost of queue operations....
[...]
...[35] store each LP’s events in a separate array....
[...]
...The considered parameter combinations were chosen according to our previous works in GPU-based simulation [35], [48] to cover cases of low utilization where the GPU could be outperformed by a single CPU core, up to configurations approaching full GPU utilization....
[...]

Proceedings Article•DOI•

Transitioning Spiking Neural Network Simulators to Heterogeneous Hardware

[...]

Quang Anh Pham Nguyen¹, Philipp Andelfinger¹, Wentong Cai¹, Alois Knoll²•Institutions (2)

Nanyang Technological University¹, Technische Universität München²

29 May 2019

TL;DR: This paper proposes a transition approach for CPU-based SNN simulators to enable the execution on heterogeneous hardware with only limited modifications to an existing simulator code base, and without changes to model code.

...read moreread less

Abstract: Spiking neural networks (SNN) are among the most computationally intensive types of simulation models, with node counts on the order of up to 10^11. Currently, there is intensive research into hardware platforms suitable to support large-scale SNN simulations, whereas several of the most widely used simulators still rely purely on the execution on CPUs. Enabling the execution of these established simulators on heterogeneous hardware allows new studies to exploit the many-core hardware prevalent in modern supercomputing environments, while still being able to reproduce and compare with results from a vast body of existing literature. In this paper, we propose a transition approach for CPU-based SNN simulators to enable the execution on heterogeneous hardware (e.g., CPUs, GPUs, and FPGAs) with only limited modifications to an existing simulator code base, and without changes to model code. Our approach relies on manual porting of a small number of core simulator functionalities as found in common SNN simulators, whereas unmodified model code is analyzed and transformed automatically. We apply our approach to the well-known simulator NEST and make a version executable on heterogeneous hardware available to the community. Our measurements show that at full utilization, a single GPU achieves the performance of about 9 CPU cores.

...read moreread less

4 citations

Cites background from "Exploiting the parallelism of large..."

..., [14, 19, 35]), including several types of network simulations [2, 3, 46]....
[...]

References

PDF

Open Access

More filters

Journal Article•DOI•

A GPU-Based Application Framework Supporting Fast Discrete-Event Simulation

[...]

Hyungwook Park, Paul A. Fishwick

01 Oct 2010

TL;DR: This work has found that irregular time advances of the sort common in discrete event models can be successfully mapped to a GPU, thus making it possible to execute discrete event systems on an inexpensive personal computer platform at speedups close to 10x.

...read moreread less

Abstract: The graphics processing unit (GPU) has evolved into a flexible and powerful processor of relatively low cost, compared to processors used for other available parallel computing systems. The majority of studies using the GPU within the graphics and simulation communities have focused on the use of the GPU for models that are traditionally simulated using regular time increments, whether these increments are accomplished through the addition of a time delta (i.e., numerical integration) or event scheduling using the delta (i.e., discrete event approximations of continuous-time systems). These types of models have the property of being decomposable over a variable or parameter space. In prior studies, discrete event simulation has been characterized as being an inefficient application for the GPU primarily due to the inherent synchronicity of the GPU organization and an apparent mismatch between the classic event scheduling cycle and the GPUâs basic functionality. However, we have found that irregular time advances of the sort common in discrete event models can be successfully mapped to a GPU, thus making it possible to execute discrete event systems on an inexpensive personal computer platform at speedups close to 10x. This speedup is achieved through the development of a special purpose code library we developed that uses an approximate time-based event scheduling approach. We present the design and implementation of this library, which is based on the compute unified device architecture (CUDA) general purpose parallel applications programming interface for the NVIDIA class of GPUs.

...read moreread less

64 citations

"Exploiting the parallelism of large..." refers background in this paper

...In 2010, Park et al. (Park and Fishwick 2010) proposed a framework for purely GPU-based discreteevent simulations, achieving a speedup close to 10....
[...]

Proceedings Article•DOI•

Conservative synchronization of large-scale network simulations

[...]

Alfred J. Park¹, Richard M. Fujimoto¹, Kalyan S. Perumalla¹•Institutions (1)

Georgia Institute of Technology¹

16 May 2004

TL;DR: This analysis and initial performance measurements suggest that for scenarios simulating scaled network models with constant number of input and output channels per logical process, an optimized null message algorithm offers better scalability than efficient global reduction based synchronous protocols.

...read moreread less

Abstract: Parallel discrete event simulation techniques have enabled the realization of large-scale models of communication networks containing millions of end hosts and routers. However, the performance of these parallel simulators could be severely degraded if proper synchronization algorithms are not utilized. In this paper, we compare the performance and scalability of synchronous and asynchronous algorithms for conservative parallel network simulation. We develop an analytical model to evaluate the efficiency and scalability of certain variations of the well-known null message algorithm, and present experimental data to verify the accuracy of this model. This analysis and initial performance measurements on parallel machines containing hundreds of processors suggest that for scenarios simulating scaled network models with constant number of input and output channels per logical process, an optimized null message algorithm offers better scalability than efficient global reduction based synchronous protocols.

...read moreread less

40 citations

"Exploiting the parallelism of large..." refers result in this paper

...In some cases, a large speedup compared with a sequential simulation was achieved (Park, Fujimoto, and Perumalla 2004), while in other cases there were modest or no performance gains (Dinh, Lees, Theodoropoulos, and Minson 2008, Quinson, Rosa, and Thiery 2012)....
[...]

Proceedings Article•DOI•

GPU-Based Architectures and Their Benefit for Accurate and Efficient Wireless Network Simulations

[...]

Philipp Andelfinger¹, Jens Mittag¹, Hannes Hartenstein¹•Institutions (1)

Karlsruhe Institute of Technology¹

25 Jul 2011

TL;DR: This paper presents and discusses four different architectures that can be used to exploit GPU-based signal processing in discrete event-based simulations, and shows that the runtime costs can not be cut down completely, but significant speedups can be expected compared to a non-GPU-based solution.

...read moreread less

Abstract: In recent years, a trend towards the usage of physical layer models with increased accuracy can be observed within the wireless network community. This trend has several reasons. The consideration of signals - instead of packets - as the smallest unit of a wireless network simulation enables the ability to reflect complex radio propagation characteristics properly, and to study novel PHY/MAC/NET cross-layer optimizations that were not directly possible before, e.g. cognitive radio networks and interference cancelation. Yet, there is a price to pay for the increase of accuracy, namely a significant decrease of runtime performance due to computationally expensive signal processing. In this paper we study whether this price can be reduced - or even eliminated - if GPU-based signal processing is employed. In particular, we present and discuss four different architectures that can be used to exploit GPU-based signal processing in discrete event-based simulations. Our evaluation shows that the runtime costs can not be cut down completely, but significant speedups can be expected compared to a non GPU-based solution.

...read moreread less

27 citations

Proceedings Article•DOI•

Multi-level Parallelism for Time- and Cost-Efficient Parallel Discrete Event Simulation on GPUs

[...]

Georg Kunz¹, Daniel Schemmel¹, James Gross¹, Klaus Wehrle¹•Institutions (1)

RWTH Aachen University¹

15 Jul 2012

TL;DR: A parallel discrete event simulation scheme that enables cost- and time-efficient execution of large scale parameter studies on GPUs and designs an event aggregation strategy based on external parallelism that generates workloads suitable for GPUs.

...read moreread less

Abstract: Developing complex technical systems requires a systematic exploration of the given design space in order to identify optimal system configurations. However, studying the effects and interactions of even a small number of system parameters often requires an extensive number of simulation runs. This in turn results in excessive runtime demands which severely hamper thorough design space explorations. In this paper, we present a parallel discrete event simulation scheme that enables cost- and time-efficient execution of large scale parameter studies on GPUs. In order to efficiently accommodate the stream-processing paradigm of GPUs, our parallelization scheme exploits two orthogonal levels of parallelism: External parallelism among the inherently independent simulations of a parameter study and internal parallelism among independent events within each individual simulation of a parameter study. Specifically, we design an event aggregation strategy based on external parallelism that generates workloads suitable for GPUs. In addition, we define a pipelined event execution mechanism based on internal parallelism to hide the transfer latencies between host- and GPU-memory. We analyze the performance characteristics of our parallelization scheme by means of a prototype implementation and show a 25-fold performance improvement over purely CPU-based execution.

...read moreread less

25 citations

Proceedings Article•DOI•

Parallel Simulation of Peer-to-Peer Systems

[...]

Martin Quinson, Cristian Rosa, Christophe Thiery

13 May 2012

TL;DR: This paper revisits the classical PDES methods in the light of distributed system simulation and proposes a new parallelization design specifically suited to this context, and an OS-inspired architecture is proposed.

...read moreread less

Abstract: Discrete Event Simulation (DES) is one of the major experimental methodologies in several scientific and engineering domains. Parallel Discrete Event Simulation (PDES) constitutes a very active research field for at least three decades, to surpass speed and size limitations. In the context of Peer-to-Peer (P2P) protocols, most studies rely on simulation. Surprisingly enough, none of the mainstream P2P discrete event simulators allows parallel simulation although the tool scalability is considered as the major quality metric by several authors. This paper revisits the classical PDES methods in the light of distributed system simulation and proposes a new parallelization design specifically suited to this context. The constraints posed on the simulator internals are presented, and an OS-inspired architecture is proposed. In addition, a new thread synchronization mechanism is introduced for efficiency despite the very fine grain parallelism inherent to the target scenarios. This new architecture was implemented into the general-purpose open-source simulation framework SimGrid. We show that the new design does not hinder the tool scalability. In fact, the sequential version of SimGrid remains orders of magnitude more scalable than state of the art simulators, while the parallel execution saves up to 33% of the execution time on Chord simulations.

...read moreread less

21 citations

"Exploiting the parallelism of large..." refers result in this paper

...In some cases, a large speedup compared with a sequential simulation was achieved (Park, Fujimoto, and Perumalla 2004), while in other cases there were modest or no performance gains (Dinh, Lees, Theodoropoulos, and Minson 2008, Quinson, Rosa, and Thiery 2012)....
[...]