scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Design and implementation of dynamic load balancing algorithms for rollback reduction in optimistic PDES

01 Jan 1999-Vlsi Design (Hindawi Publishing Corporation)-Vol. 9, Iss: 3, pp 271-290
TL;DR: Two algorithms for dynamic load balancing which reduce the number of rollbacks in an optimistic PDES system are proposed which are based on the load transfer mechanism between lps and the process migration which migrates logical processes between several pairs of physical processors.
Abstract: In an optimistic parallel simulation, logical processes (Ips) proceed with their computation without any constraints. However, if the computing requirements of different lps are not balanced or if the processors are not homogeneous, some lps may lag behind in simulation time while others surge forward. In other words, if the simulation clocks of different lps are not progressing at the same rate, cascading rollbacks may occur nullifying the potential benefit of an optimistic parallel discrete event simulation (PDES). Hence it is necessary to balance the computational load on different lps in such a way that their local simulation clocks advance almost at the same rate. In this paper, we propose two algorithms for dynamic load balancing which reduce the number of rollbacks in an optimistic PDES system. Our first algorithm is based on the load transfer mechanism between lps; while the second algorithm, based on the principle of evolutionary strategy, migrates logical processes between several pairs of physical processors. We have implemented both of these algorithms on a cluster of heterogeneous workstations and studied their performance. The experimental results show that the algorithm based on the load transfer is effective when the grain size is greater than 10 milliseconds. The algorithm based on the process migration yields good performance only for grain sizes of 20 milliseconds or larger. In both of these cases the speed up ranges mostly between and 2 using four processors.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: The results obtained indicate clearly that a partitioning which make use of the simulated annealing significantly reduces the running time of a conservative simulation, and decreases the synchronization overhead of the simulation model when compared to Nandy-Louck's partitioning algorithm.

14 citations


Cites background from "Design and implementation of dynami..."

  • ...Sarkar and Das [34], proposed two algorithms for dynamic load balancing which reduce the number of rollbacks in an optimistic PDES system....

    [...]

Proceedings ArticleDOI
01 Oct 2016
TL;DR: A dynamic load-profiling and segment-aware scheduling algorithm with optimized thread dispatching to maximize parallel SystemC simulation speed, which generally can be applied to all work-sharing PDES approaches.
Abstract: The SystemC IEEE standard is widely used for system design. While the sequential reference simulator is based on Discrete Event Simulation (DES), Parallel DES (PDES) approaches have been proposed for multi-core platforms. This paper proposes a dynamic load-profiling and segment-aware scheduling algorithm with optimized thread dispatching to maximize parallel SystemC simulation speed, which generally can be applied to all work-sharing PDES approaches. Based on a compile-time generated Segment Graph (SG), our scheduler can accurately predict the run time of the thread segments ahead and thus make better dispatching decisions. In the systematic evaluation, our segment-aware scheduler consistently shows a significant performance gain on top of the order-of-magnitude speedup of PDES, when compared with the previous scheduling policies.

5 citations


Cites background from "Design and implementation of dynami..."

  • ...[9] presents a dynamic load migration algorithm for reducing the total...

    [...]

01 Jan 2017
TL;DR: This dissertation proposes a computation- and communication-aware approach to optimize thread mapping for parallel ESL simulation, with the aims of load balancing and communication minimization, and shows a significant performance gain on top of the order-of-magnitude speedup of PDES.
Abstract: In hardware/software codesign, Discrete Event Simulation (DES) has been in use for decades to verify and validate the functionality of Electronic System Level (ESL) models. Since the parallel computing platforms are readily available today, many Parallel Discrete Event Simulation (PDES) approaches are proposed to improve the simulation performance. However, as the thread parallelism increases in ESL designs and core count multiplies on multi-core and many-core platforms, thread-to-core mapping becomes critical in PDES.In this dissertation, we propose a computation- and communication-aware approach to optimize thread mapping for parallel ESL simulation, with the aims of load balancing and communication minimization. As we identify that the order of dispatching parallel threads has a significant influence on the total simulation time, and Longest Job First (LJF) shows better performance than the Linux default thread dispatch policy, we first propose a segment-aware LJF scheduler for PDES. Our segment-aware scheduler can accurately predict the run time of the thread segments ahead, and thus make better dispatching decisions. Next, we define the concept of core distance for multi-core and many-core architectures, which quantifies core-to-core communication latency and characterizes processor hierarchies. For many-core architectures using directory-based cache coherence protocols, we observe that core-to-core transfers are not always significantly faster than main memory accesses, and the core-to-core communication latency depends not only on the physical placement on the chip, but also on the location of the distributed cache tag directory. Thus, using a ping-pong memory benchmark, we quantify the core distance on a ring-network many-core platform and propose an algorithm to optimize thread-to-core mapping in order to minimize on-chip communication overhead. Altogether, based on a static analysis of communication patterns and core distance and a dynamic profiling of computation load, our proposed framework utilizes a heuristic graph partitioning algorithm and automatically generates an optimized thread mapping, which minimizes inter-chip communication overhead. In our systematic evaluation, our approach consistently shows a significant performance gain on top of the order-of-magnitude speedup of PDES.The contributions of this dissertation include a segment-aware multi-core scheduler, core distance profiling, a communication-aware thread mapping framework, together with an open-source software package for Out-of-Order PDES.

4 citations


Cites background from "Design and implementation of dynami..."

  • ...[73] presents a dynamic load migration algorithm for reducing the total number of rollbacks in an optimistic PDES environment....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: The results obtained indicate clearly that a partitioning which make use of the simulated annealing significantly reduces the running time of a conservative simulation, and decreases the synchronization overhead of the simulation model when compared to Nandy-Louck's partitioning algorithm.

14 citations

Proceedings ArticleDOI
01 Oct 2016
TL;DR: A dynamic load-profiling and segment-aware scheduling algorithm with optimized thread dispatching to maximize parallel SystemC simulation speed, which generally can be applied to all work-sharing PDES approaches.
Abstract: The SystemC IEEE standard is widely used for system design. While the sequential reference simulator is based on Discrete Event Simulation (DES), Parallel DES (PDES) approaches have been proposed for multi-core platforms. This paper proposes a dynamic load-profiling and segment-aware scheduling algorithm with optimized thread dispatching to maximize parallel SystemC simulation speed, which generally can be applied to all work-sharing PDES approaches. Based on a compile-time generated Segment Graph (SG), our scheduler can accurately predict the run time of the thread segments ahead and thus make better dispatching decisions. In the systematic evaluation, our segment-aware scheduler consistently shows a significant performance gain on top of the order-of-magnitude speedup of PDES, when compared with the previous scheduling policies.

5 citations

01 Jan 2017
TL;DR: This dissertation proposes a computation- and communication-aware approach to optimize thread mapping for parallel ESL simulation, with the aims of load balancing and communication minimization, and shows a significant performance gain on top of the order-of-magnitude speedup of PDES.
Abstract: In hardware/software codesign, Discrete Event Simulation (DES) has been in use for decades to verify and validate the functionality of Electronic System Level (ESL) models. Since the parallel computing platforms are readily available today, many Parallel Discrete Event Simulation (PDES) approaches are proposed to improve the simulation performance. However, as the thread parallelism increases in ESL designs and core count multiplies on multi-core and many-core platforms, thread-to-core mapping becomes critical in PDES.In this dissertation, we propose a computation- and communication-aware approach to optimize thread mapping for parallel ESL simulation, with the aims of load balancing and communication minimization. As we identify that the order of dispatching parallel threads has a significant influence on the total simulation time, and Longest Job First (LJF) shows better performance than the Linux default thread dispatch policy, we first propose a segment-aware LJF scheduler for PDES. Our segment-aware scheduler can accurately predict the run time of the thread segments ahead, and thus make better dispatching decisions. Next, we define the concept of core distance for multi-core and many-core architectures, which quantifies core-to-core communication latency and characterizes processor hierarchies. For many-core architectures using directory-based cache coherence protocols, we observe that core-to-core transfers are not always significantly faster than main memory accesses, and the core-to-core communication latency depends not only on the physical placement on the chip, but also on the location of the distributed cache tag directory. Thus, using a ping-pong memory benchmark, we quantify the core distance on a ring-network many-core platform and propose an algorithm to optimize thread-to-core mapping in order to minimize on-chip communication overhead. Altogether, based on a static analysis of communication patterns and core distance and a dynamic profiling of computation load, our proposed framework utilizes a heuristic graph partitioning algorithm and automatically generates an optimized thread mapping, which minimizes inter-chip communication overhead. In our systematic evaluation, our approach consistently shows a significant performance gain on top of the order-of-magnitude speedup of PDES.The contributions of this dissertation include a segment-aware multi-core scheduler, core distance profiling, a communication-aware thread mapping framework, together with an open-source software package for Out-of-Order PDES.

4 citations