scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A block-asynchronous relaxation method for graphics processing units

TL;DR: This paper develops asynchronous iteration algorithms in CUDA and compares them with parallel implementations of synchronous relaxation methods on CPU- or GPU-based systems and identifies the high potential of the asynchronous methods for Exascale computing.
About: This article is published in Journal of Parallel and Distributed Computing.The article was published on 2013-12-01 and is currently open access. It has received 28 citations till now. The article focuses on the topics: Asynchronous communication & CUDA.

Summary (3 min read)

Introduction

  • The authors analyze the potential of asynchronous relaxation methods on Graphics Processing Units (GPUs).
  • The latest developments in hardware architectures show an enormous increase in the number of processing units (computing cores) that form one processor.
  • On the other hand, numerical algorithms usually require this synchronization.
  • In the following section the authors analyze the experiment results with focus on the convergence behavior and the iteration times for the different matrix systems.

B. Asynchronous Iteration Methods

  • For computing the next iteration in a relaxation method, one usually requires the latest values of all components.
  • The question of interest that the authors want to investigate is, what happens if this order is not adhered.
  • Since the synchronization usually thwarts the overall performance, it may be true that the asynchronous iteration schemes overcompensate the inferior convergence behavior by superior scalability.
  • (3) Furthermore, the following conditions can be defined to guarantee the well-posedness of the algorithm [17]: 1) The update function u(·) takes each of the values l for 1 ≤ l ≤ N infinitely often.
  • The AGS method uses new values of unknowns in its subsequent updates as soon as they are computed in the same iteration, while the AJ method uses only values that are set at the beginning of an iteration.

A. Linear Systems of Equations

  • In their experiments, the authors search for the approximate solutions of linear system of equations, where the respective matrices are taken from the University of Florida Matrix Collection (UFMC; see http://www.cise.ufl.edu/research/ sparse/matrices/).
  • Due to the convergence properties of the iterative methods considered the experiment matrices have to be properly chosen.
  • While for the Jacobi method a sufficient condition for convergence is clearly ρ(M) = ρ(I − D−1A) < 1 (i.e., the spectral radius of the iteration matrix M to be smaller than one), the convergence theory for asynchronous iteration methods is more involved (and is not the subject of this paper).
  • The matrices and their descriptions are summarized in Table II, their structures can be found in Figure 1.
  • The authors furthermore take the number of right-hand sides to be one for all linear systems.

B. Hardware and Software Issues

  • The experiments were conducted on a heterogeneous GPU-accelerated multicore system located at the University of Tennessee, Knoxville.
  • In the synchronous implementation of Gauss-Seidel on the CPU, 4 cores are used for the matrix-vector operations that can be parallelized.
  • The component updates were coded in CUDA, using thread blocks of size 512.

C. An asynchronous iteration method for GPUs

  • The asynchronous iteration method for GPUs that the authors propose is split into two levels.
  • This is due to the design of graphics processing units and the CUDA programming language.
  • For these thread blocks, a PA iteration method is used, while on each thread block, a Jacobi-like iteration method is performed.
  • During the local iterations the x values used from outside the block are kept constant (equal to their values at the beginning of the local iterations).
  • The shift function ν(m + 1, j) denotes the iteration shift for the component j - this can be positive or negative, depending on whether the respective other thread block already has conducted more or less iterations.

A. Stochastic impact of chaotic behavior of asynchronous iteration methods

  • At this point it should be mentioned, that only the synchronous Gauss-Seidel and Jacobi methods are deterministic.
  • For the asynchronous iteration method on the GPU, the results are not reproducible at all, since for every iteration run, a very unique pattern of component updates is conducted.
  • It may be possible, that another component update order may result in faster or slower convergence.
  • The results reported in Table IV are based on 100 simulation runs on the test matrix FV3.
  • Analyzing the results, the authors observe only small variations in the convergence behavior: for 1000 iterations the relative residual is improved by 5 · 10−2, the maximal variation between the fastest and the slowest convergence rate is in the order of 10−5.

B. Convergence rate of the asynchronous iteration method

  • In the next experiment, the authors analyze the convergence behavior of the asynchronous iteration method and compare it with the convergence rate of the Gauss-Seidel and Jacobi method.
  • The experiment results, summarized in Figures 4, 5, 6, 7 and 9, show that for test systems CHEM97ZTZ, FV1, FV2, FV3 and TREFETHEN 2000 the synchronous Gauss-Seidel algorithm converges in considerably less iterations.
  • This superior convergence behaviour is intuitively expected, since the synchronization after each component update allows the use the updated components immediately for the next update.
  • Still, the authors observe for all test cases convergence rates similar to the synchronized counterpart, which is also almost doubled compared to Gauss-Seidel.
  • The results for test matrix S1RMT3M1 show an example where neither of the methods is suitable for direct use.

C. Block-asynchronous iteration method

  • The authors now consider a block-asynchronous iteration method which additionally performs a few Jacobi-like iterations on every subdomain.
  • A motivation for this approach is hardware related – specifically, this is the fact that the additional local iterations almost come for free (as the subdomains are relatively small and the data needed largely fits into the multiprocessors’ caches).
  • The case for TREFETHEN 2000 is similar – although there is improvement compared to Jacobi, the rate of convergence for async-(5) is not twice better than Gauss-Seidel, and the reason is again the structure of the local matrices .
  • Due to the large overhead when performing only a small number of iterations, the average computation time per iteration decreases significantly for cases where a large number of iterations is conducted.
  • Overall, the authors observe, that the average iteration time for the async-(5) method using the GPU is only a fraction of the time needed to conduct one iteration of the synchronous Gauss-Seidel on the CPU.

D. Performance of the block-asynchronous iteration method

  • To analyze the performance of the block-asynchronous iteration method, the authors show in Figures 17, 18 19, 20, and 21 the average time needed for the synchronous GaussSeidel, the synchronous Jacobi and the block-asynchronous iteration method to provide a solution approximation of certain accuracy, relative to the initial residual.
  • Since the convergence of the Gauss-Seidel method for S1RMT3M1 is almost negligible, and the Jacobi and the asynchronous iteration does not converge at all, the authors limit this analysis to the linear systems of equations CHEM97ZTZ, FV1, FV2, FV3 and TREFETHEN 2000.
  • This is due to the fact that for these high iteration numbers the overhead triggered by memory transfer and GPU kernel call has minor impact.
  • For linear equation systems with considerable off-diagonal part e.g.
  • Still, due to the faster kernel execution the block-asynchronous iteration provides the solution approximation in shorter time.

V. CONCLUSIONS

  • The authors developed asynchronous relaxation methods for highly parallel architectures.
  • The experiments have revealed the potential of using them on GPUs.
  • The absence of synchronization points enables not only to reach a high scalability, but also to efficiently use the GPU architecture.
  • Nevertheless, the numerical properties of asynchronous iteration pose some restrictions on the usage.
  • The presented approach could be embedded in a multigrid framework, replacing the traditional Gauss-Seidel based smoothers.

Did you find this useful? Give us your feedback

Citations
More filters
Journal ArticleDOI
30 Oct 2020-PLOS ONE
TL;DR: A novel strategy to the method of finite elements (FEM) of linear elastic problems of very high resolution on graphic processing units (GPU) that exploits regularities in the system matrix that occur in regular hexahedral grids to achieve cache-friendly matrix-free FEM.
Abstract: In this study, we present a novel strategy to the method of finite elements (FEM) of linear elastic problems of very high resolution on graphic processing units (GPU). The approach exploits regularities in the system matrix that occur in regular hexahedral grids to achieve cache-friendly matrix-free FEM. The node-by-node method lies in the class of block-iterative Gauss-Seidel multigrid solvers. Our method significantly improves convergence times in cases where an ordered distribution of distinct materials is present in the dataset. The method was evaluated on three real world datasets: An aluminum-silicon (AlSi) alloy and a dual phase steel material sample, both captured by scanning electron tomography, and a clinical computed tomography (CT) scan of a tibia. The caching scheme leads to a speed-up factor of ×2-×4 compared to the same code without the caching scheme. Additionally, it facilitates the computation of high-resolution problems that cannot be computed otherwise due to memory consumption.
Journal ArticleDOI
01 Oct 2019
TL;DR: Results show a time reduction of up to 58.4 % in relation to the parallel Power method, when a small number of local updates is performed before each global synchronization, outperforming both the two-stage algorithms and the extrapolation algorithms, more sharply as the number of processes increases.
Abstract: In this work, a non-stationary technique based on the Power method for accelerating the parallel computation of the PageRank vector is proposed and its theoretical convergence analyzed. This iterative non-stationary model, which uses the eigenvector formulation of the PageRank problem, reduces the needed computations for obtaining the PageRank vector by eliminating synchronization points among processes, in such a way that, at each iteration of the Power method, the block of iterate vector assigned to each process can be locally updated more than once, before performing a global synchronization. The parallel implementation of several strategies combining this novel non-stationary approach and the extrapolation methods has been developed using hybrid MPI/OpenMP programming. The experiments have been carried out on a cluster made up of 12 nodes, each one equipped with two Intel Xeon hexacore processors. The behaviour of the proposed parallel algorithms has been studied with realistic datasets, highlighting their performance compared with other parallel techniques for solving the PageRank problem. Concretely, the experimental results show a time reduction of up to 58.4 % in relation to the parallel Power method, when a small number of local updates is performed before each global synchronization, outperforming both the two-stage algorithms and the extrapolation algorithms, more sharply as the number of processes increases.
References
More filters
Book
01 Apr 2003
TL;DR: This chapter discusses methods related to the normal equations of linear algebra, and some of the techniques used in this chapter were derived from previous chapters of this book.
Abstract: Preface 1. Background in linear algebra 2. Discretization of partial differential equations 3. Sparse matrices 4. Basic iterative methods 5. Projection methods 6. Krylov subspace methods Part I 7. Krylov subspace methods Part II 8. Methods related to the normal equations 9. Preconditioned iterations 10. Preconditioning techniques 11. Parallel implementations 12. Parallel preconditioners 13. Multigrid methods 14. Domain decomposition methods Bibliography Index.

13,484 citations


"A block-asynchronous relaxation met..." refers methods in this paper

  • ...For example, when solving linear systems of equations with iterative methods like the Conjugate Gradient or GMRES, the parallelism is usually limited to the matrix-vector and the vector-vector operations (with synchronization required between them) [4] [5] [6]....

    [...]

Journal ArticleDOI
TL;DR: An iterative method for solving linear systems, which has the property of minimizing at every step the norm of the residual vector over a Krylov subspace.
Abstract: We present an iterative method for solving linear systems, which has the property of minimizing at every step the norm of the residual vector over a Krylov subspace. The algorithm is derived from t...

10,907 citations

Book
01 Jan 1987
TL;DR: Preface How to Get the Software How to get the Software Part I.
Abstract: Preface How to Get the Software Part I. Linear Equations. 1. Basic Concepts and Stationary Iterative Methods 2. Conjugate Gradient Iteration 3. GMRES Iteration Part II. Nonlinear Equations. 4. Basic Concepts and Fixed Point Iteration 5. Newton's Method 6. Inexact Newton Methods 7. Broyden's Method 8. Global Convergence Bibliography Index.

2,531 citations


"A block-asynchronous relaxation met..." refers methods in this paper

  • ...For some algorithms, e.g. Gauss-Seidel [7], even the already computed values of the current iteration step are used....

    [...]

  • ...Also, methods that are based on component-wise updates like Jacobi or Gauss- Seidel have synchronization between the iteration steps [7], [8]: no component is updated twice (or more) before all other components are updated....

    [...]

Journal ArticleDOI
01 Feb 2011
TL;DR: The work of the community to prepare for the challenges of exascale computing is described, ultimately combing their efforts in a coordinated International Exascale Software Project.
Abstract: Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual petascale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/ exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and graphics processing units. This report describes the work of the community to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project.

736 citations


"A block-asynchronous relaxation met..." refers methods in this paper

  • ...The reason for this varies from various physical limitations to energy minimization considerations that are at odds with further scaling up of processor' frequencies – the basic acceleration method used in the architecture designs for the last decades [1]....

    [...]

Journal ArticleDOI
TL;DR: A class of asynchronous iterative methods is presented for solving a system of equations corresponding to a parallel implementation on a multiprocessor system with no synchronization between cooperating processes to show clearly the advantage of purely asynchronous Iterative methods.
Abstract: : A class of asynchronous iterative methods is presented for solving a system of equations. Existing iterative methods are identified in terms of asynchronous iterations, and new schemes are introduced corresponding to a parallel implementation on a multiprocessor system with no synchronization between cooperating processes. A sufficient condition is given to guarantee the convergence of any asynchronous iterations, and results are extended to include iterative methods with memory. Asynchronous iterative methods are then evaluated from a computational point of view, and bounds are derived for the efficiency. The bounds are compared with actual measurements obtained by running various asynchronous iterations on a multiprocessor, and the experimental results show clearly the advantage of purely asynchronous iterative methods. (Author)

539 citations