scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Exploiting the parallelism of large-scale application-layer networks by adaptive GPU-based simulation

07 Dec 2014-pp 3471-3482
TL;DR: A GPU-based simulator engine that performs all steps of large-scale network simulations on a commodity many-core GPU and adapts its configuration at runtime in order to balance parallelism and overheads to achieve high performance for a given network model and scenario is presented.
Abstract: We present a GPU-based simulator engine that performs all steps of large-scale network simulations on a commodity many-core GPU. Overhead is reduced by avoiding unnecessary data transfers between graphics memory and main memory. On the example of a widely deployed peer-to-peer network, we analyze the parallelism in large-scale application-layer networks, which suggests the use of thousands of concurrent processor cores for simulation. The proposed simulator employs the vast number of parallel cores in modern GPUs to exploit the identified parallelism and enables substantial simulation speedup. The simulator adapts its configuration at runtime in order to balance parallelism and overheads to achieve high performance for a given network model and scenario. A performance evaluation for simulations of networks comprising up to one million peers demonstrates a speedup of up to 19.5 compared with an efficient sequential implementation and shows the effectiveness of the runtime adaptation to different network conditions.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: In this paper, the authors provide an overview and categorisation of the literature according to the applied techniques for agent-based simulations on hardware accelerators, and sketch directions for future research towards automating the hardware mapping and execution.
Abstract: Due to decelerating gains in single-core CPU performance, computationally expensive simulations are increasingly executed on highly parallel hardware platforms. Agent-based simulations, where simulated entities act with a certain degree of autonomy, frequently provide ample opportunities for parallelisation. Thus, a vast variety of approaches proposed in the literature demonstrated considerable performance gains using hardware platforms such as many-core CPUs and GPUs, merged CPU-GPU chips as well as Field Programmable Gate Arrays. Typically, a combination of techniques is required to achieve high performance for a given simulation model, putting substantial burden on modellers. To the best of our knowledge, no systematic overview of techniques for agent-based simulations on hardware accelerators has been given in the literature. To close this gap, we provide an overview and categorisation of the literature according to the applied techniques. Since, at the current state of research, challenges such as the partitioning of a model for execution on heterogeneous hardware are still addressed in a largely manual process, we sketch directions for future research towards automating the hardware mapping and execution. This survey targets modellers seeking an overview of suitable hardware platforms and execution techniques for a specific simulation model, as well as methodology researchers interested in potential research gaps requiring further exploration.

52 citations

Proceedings ArticleDOI
16 May 2017
TL;DR: This work presents the design and implementation of an optimistic fully GPU-based parallel discrete-event simulator based on the Time Warp synchronization algorithm, and shows that in most cases, the increase in parallelism when using optimistic synchronization significantly outweighs the increased overhead for state keeping and rollbacks.
Abstract: The parallel execution of discrete-event simulations on commodity GPUs has been shown to achieve high event rates Most previous proposals have focused on conservative synchronization, which typically extracts only limited parallelism in cases of low event density in simulated time We present the design and implementation of an optimistic fully GPU-based parallel discrete-event simulator based on the Time Warp synchronization algorithm The optimistic simulator implementation is compared with an otherwise identical implementation using conservative synchronization Our evaluation shows that in most cases, the increase in parallelism when using optimistic synchronization significantly outweighs the increased overhead for state keeping and rollbacks To reduce the cost of state keeping, we show how XORWOW, the default pseudo-random number generator in CUDA, can be reversed based solely on its current state Since the optimal configuration of multiple performance-critical simulator parameters depends on the behavior of the simulation model, these parameters are adapted dynamically based on performance measurements and heuristic optimization at runtime We evaluate the simulator using the PHOLD benchmark model and a simplified model of peer-to-peer networks using the Kademlia protocol On a commodity GPU, the optimistic simulator achieves event rates of up to 814 million events per second and a speedup of up to 36 compared with conservative synchronization

13 citations


Cites background from "Exploiting the parallelism of large..."

  • ...proposed a fully GPU-based conservative simulator implementation that adapts the LP size at runtime to balance parallelism and event management overheads [1]....

    [...]

  • ...In fully GPU-based simulation [1, 13, 21, 22, 28, 32], the simulator core is executed on the GPU as well....

    [...]

Posted Content
TL;DR: This survey targets modellers seeking an overview of suitable hardware platforms and execution techniques for a specific simulation model, as well as methodology researchers interested in potential research gaps requiring further exploration.
Abstract: Due to decelerating gains in single-core CPU performance, computationally expensive simulations are increasingly executed on highly parallel hardware platforms. Agent-based simulations, where simulated entities act with a certain degree of autonomy, frequently provide ample opportunities for parallelisation. Thus, a vast variety of approaches proposed in the literature demonstrated considerable performance gains using hardware platforms such as many-core CPUs and GPUs, merged CPU-GPU chips as well as FPGAs. Typically, a combination of techniques is required to achieve high performance for a given simulation model, putting substantial burden on modellers. To the best of our knowledge, no systematic overview of techniques for agent-based simulations on hardware accelerators has been given in the literature. To close this gap, we provide an overview and categorisation of the literature according to the applied techniques. Since at the current state of research, challenges such as the partitioning of a model for execution on heterogeneous hardware are still a largely manual process, we sketch directions for future research towards automating the hardware mapping and execution. This survey targets modellers seeking an overview of suitable hardware platforms and execution techniques for a specific simulation model, as well as methodology researchers interested in potential research gaps requiring further exploration.

10 citations


Cites background or methods from "Exploiting the parallelism of large..."

  • ...Representation of irregular data structures by arrays and grids APU [180], GPU [47, 69, 98, 114, 144–146, 152, 166, 177] [7, 14, 95, 109, 140, 141, 159, 168, 183, 196], FPGA [121, 149]...

    [...]

  • ...Other works assume a minimum time delta between an event and its creation (lookahead) to guarantee the correctness of the simulation results [7, 157, 196]....

    [...]

  • ...Instead, the set of events is considered jointly in an unsorted fashion [168], split by model segment [159] or simulated entity [7, 109, 196], split according to a fixed policy [141, 183], or split randomly [121]....

    [...]

Proceedings ArticleDOI
01 Sep 2017
TL;DR: This work performs a performance evaluation of GPU-based priority queue implementations for two applications: discrete-event simulation and parallel A* path searches on grids and presents performance measurements covering linear queue designs, implicit binary heaps, splay trees, and a GPU-specific proposal from the literature.
Abstract: Graphics processing units (GPUs) are increasingly applied to accelerate tasks such as graph problems and discreteevent simulation that are characterized by irregularity, i.e., a strong dependence of the control flow and memory accesses on the input. The core data structure in many of these irregular tasks are priority queues that guide the progress of the computations and which can easily become the bottleneck of an application. To our knowledge, currently no systematic comparison of priority queue implementations on GPUs exists in the literature. We close this gap by a performance evaluation of GPU-based priority queue implementations for two applications: discrete-event simulation and parallel A* path searches on grids. We focus on scenarios requiring large numbers of priority queues holding up to a few thousand items each. We present performance measurements covering linear queue designs, implicit binary heaps, splay trees, and a GPU-specific proposal from the literature. The measurement results show that up to about 500 items per queue, circular buffers frequently outperform tree-based queues for the considered applications, particularly under a simple parallelization of individual item enqueue operations. We analyze profiling metrics to explore classical queue designs in light of the importance of high hardware utilization as well as homogeneous computations and memory accesses across GPU threads.

8 citations


Cites background or methods from "Exploiting the parallelism of large..."

  • ...The literature proposes two solutions: first, merging queues of multiple simulated entities increases the probability of having events that can safely be executed [35]....

    [...]

  • ...Autotuning approaches, which have previously been shown to be highly beneficial in the GPU context [49], [35], might help in selecting a suitable queue....

    [...]

  • ...In [35], the number of simulated entities assigned to each LP is adapted to balance idle threads and the cost of queue operations....

    [...]

  • ...[35] store each LP’s events in a separate array....

    [...]

  • ...The considered parameter combinations were chosen according to our previous works in GPU-based simulation [35], [48] to cover cases of low utilization where the GPU could be outperformed by a single CPU core, up to configurations approaching full GPU utilization....

    [...]

Proceedings ArticleDOI
29 May 2019
TL;DR: This paper proposes a transition approach for CPU-based SNN simulators to enable the execution on heterogeneous hardware with only limited modifications to an existing simulator code base, and without changes to model code.
Abstract: Spiking neural networks (SNN) are among the most computationally intensive types of simulation models, with node counts on the order of up to 10^11. Currently, there is intensive research into hardware platforms suitable to support large-scale SNN simulations, whereas several of the most widely used simulators still rely purely on the execution on CPUs. Enabling the execution of these established simulators on heterogeneous hardware allows new studies to exploit the many-core hardware prevalent in modern supercomputing environments, while still being able to reproduce and compare with results from a vast body of existing literature. In this paper, we propose a transition approach for CPU-based SNN simulators to enable the execution on heterogeneous hardware (e.g., CPUs, GPUs, and FPGAs) with only limited modifications to an existing simulator code base, and without changes to model code. Our approach relies on manual porting of a small number of core simulator functionalities as found in common SNN simulators, whereas unmodified model code is analyzed and transformed automatically. We apply our approach to the well-known simulator NEST and make a version executable on heterogeneous hardware available to the community. Our measurements show that at full utilization, a single GPU achieves the performance of about 9 CPU cores.

4 citations


Cites background from "Exploiting the parallelism of large..."

  • ..., [14, 19, 35]), including several types of network simulations [2, 3, 46]....

    [...]

References
More filters
Journal ArticleDOI
01 Oct 2010
TL;DR: This work has found that irregular time advances of the sort common in discrete event models can be successfully mapped to a GPU, thus making it possible to execute discrete event systems on an inexpensive personal computer platform at speedups close to 10x.
Abstract: The graphics processing unit (GPU) has evolved into a flexible and powerful processor of relatively low cost, compared to processors used for other available parallel computing systems. The majority of studies using the GPU within the graphics and simulation communities have focused on the use of the GPU for models that are traditionally simulated using regular time increments, whether these increments are accomplished through the addition of a time delta (i.e., numerical integration) or event scheduling using the delta (i.e., discrete event approximations of continuous-time systems). These types of models have the property of being decomposable over a variable or parameter space. In prior studies, discrete event simulation has been characterized as being an inefficient application for the GPU primarily due to the inherent synchronicity of the GPU organization and an apparent mismatch between the classic event scheduling cycle and the GPU’s basic functionality. However, we have found that irregular time advances of the sort common in discrete event models can be successfully mapped to a GPU, thus making it possible to execute discrete event systems on an inexpensive personal computer platform at speedups close to 10x. This speedup is achieved through the development of a special purpose code library we developed that uses an approximate time-based event scheduling approach. We present the design and implementation of this library, which is based on the compute unified device architecture (CUDA) general purpose parallel applications programming interface for the NVIDIA class of GPUs.

64 citations


"Exploiting the parallelism of large..." refers background in this paper

  • ...In 2010, Park et al. (Park and Fishwick 2010) proposed a framework for purely GPU-based discrete­event simulations, achieving a speedup close to 10....

    [...]

Proceedings ArticleDOI
16 May 2004
TL;DR: This analysis and initial performance measurements suggest that for scenarios simulating scaled network models with constant number of input and output channels per logical process, an optimized null message algorithm offers better scalability than efficient global reduction based synchronous protocols.
Abstract: Parallel discrete event simulation techniques have enabled the realization of large-scale models of communication networks containing millions of end hosts and routers. However, the performance of these parallel simulators could be severely degraded if proper synchronization algorithms are not utilized. In this paper, we compare the performance and scalability of synchronous and asynchronous algorithms for conservative parallel network simulation. We develop an analytical model to evaluate the efficiency and scalability of certain variations of the well-known null message algorithm, and present experimental data to verify the accuracy of this model. This analysis and initial performance measurements on parallel machines containing hundreds of processors suggest that for scenarios simulating scaled network models with constant number of input and output channels per logical process, an optimized null message algorithm offers better scalability than efficient global reduction based synchronous protocols.

40 citations


"Exploiting the parallelism of large..." refers result in this paper

  • ...In some cases, a large speedup compared with a sequential simulation was achieved (Park, Fujimoto, and Perumalla 2004), while in other cases there were modest or no performance gains (Dinh, Lees, Theodoropoulos, and Minson 2008, Quinson, Rosa, and Thiery 2012)....

    [...]

Proceedings ArticleDOI
25 Jul 2011
TL;DR: This paper presents and discusses four different architectures that can be used to exploit GPU-based signal processing in discrete event-based simulations, and shows that the runtime costs can not be cut down completely, but significant speedups can be expected compared to a non-GPU-based solution.
Abstract: In recent years, a trend towards the usage of physical layer models with increased accuracy can be observed within the wireless network community. This trend has several reasons. The consideration of signals - instead of packets - as the smallest unit of a wireless network simulation enables the ability to reflect complex radio propagation characteristics properly, and to study novel PHY/MAC/NET cross-layer optimizations that were not directly possible before, e.g. cognitive radio networks and interference cancelation. Yet, there is a price to pay for the increase of accuracy, namely a significant decrease of runtime performance due to computationally expensive signal processing. In this paper we study whether this price can be reduced - or even eliminated - if GPU-based signal processing is employed. In particular, we present and discuss four different architectures that can be used to exploit GPU-based signal processing in discrete event-based simulations. Our evaluation shows that the runtime costs can not be cut down completely, but significant speedups can be expected compared to a non GPU-based solution.

27 citations

Proceedings ArticleDOI
15 Jul 2012
TL;DR: A parallel discrete event simulation scheme that enables cost- and time-efficient execution of large scale parameter studies on GPUs and designs an event aggregation strategy based on external parallelism that generates workloads suitable for GPUs.
Abstract: Developing complex technical systems requires a systematic exploration of the given design space in order to identify optimal system configurations. However, studying the effects and interactions of even a small number of system parameters often requires an extensive number of simulation runs. This in turn results in excessive runtime demands which severely hamper thorough design space explorations. In this paper, we present a parallel discrete event simulation scheme that enables cost- and time-efficient execution of large scale parameter studies on GPUs. In order to efficiently accommodate the stream-processing paradigm of GPUs, our parallelization scheme exploits two orthogonal levels of parallelism: External parallelism among the inherently independent simulations of a parameter study and internal parallelism among independent events within each individual simulation of a parameter study. Specifically, we design an event aggregation strategy based on external parallelism that generates workloads suitable for GPUs. In addition, we define a pipelined event execution mechanism based on internal parallelism to hide the transfer latencies between host- and GPU-memory. We analyze the performance characteristics of our parallelization scheme by means of a prototype implementation and show a 25-fold performance improvement over purely CPU-based execution.

25 citations

Proceedings ArticleDOI
13 May 2012
TL;DR: This paper revisits the classical PDES methods in the light of distributed system simulation and proposes a new parallelization design specifically suited to this context, and an OS-inspired architecture is proposed.
Abstract: Discrete Event Simulation (DES) is one of the major experimental methodologies in several scientific and engineering domains. Parallel Discrete Event Simulation (PDES) constitutes a very active research field for at least three decades, to surpass speed and size limitations. In the context of Peer-to-Peer (P2P) protocols, most studies rely on simulation. Surprisingly enough, none of the mainstream P2P discrete event simulators allows parallel simulation although the tool scalability is considered as the major quality metric by several authors. This paper revisits the classical PDES methods in the light of distributed system simulation and proposes a new parallelization design specifically suited to this context. The constraints posed on the simulator internals are presented, and an OS-inspired architecture is proposed. In addition, a new thread synchronization mechanism is introduced for efficiency despite the very fine grain parallelism inherent to the target scenarios. This new architecture was implemented into the general-purpose open-source simulation framework SimGrid. We show that the new design does not hinder the tool scalability. In fact, the sequential version of SimGrid remains orders of magnitude more scalable than state of the art simulators, while the parallel execution saves up to 33% of the execution time on Chord simulations.

21 citations


"Exploiting the parallelism of large..." refers result in this paper

  • ...In some cases, a large speedup compared with a sequential simulation was achieved (Park, Fujimoto, and Perumalla 2004), while in other cases there were modest or no performance gains (Dinh, Lees, Theodoropoulos, and Minson 2008, Quinson, Rosa, and Thiery 2012)....

    [...]