scispace - formally typeset

Journal ArticleDOI

A block-asynchronous relaxation method for graphics processing units

01 Dec 2013-Journal of Parallel and Distributed Computing (Academic Press, Inc.)-Vol. 73, Iss: 12, pp 1613-1626

TL;DR: This paper develops asynchronous iteration algorithms in CUDA and compares them with parallel implementations of synchronous relaxation methods on CPU- or GPU-based systems and identifies the high potential of the asynchronous methods for Exascale computing.

AbstractIn this paper, we analyze the potential of asynchronous relaxation methods on Graphics Processing Units (GPUs). We develop asynchronous iteration algorithms in CUDA and compare them with parallel implementations of synchronous relaxation methods on CPU- or GPU-based systems. For a set of test matrices from UFMC we investigate convergence behavior, performance and tolerance to hardware failure. We observe that even for our most basic asynchronous relaxation scheme, the method can efficiently leverage the GPUs computing power and is, despite its lower convergence rate compared to the Gauss-Seidel relaxation, still able to provide solution approximations of certain accuracy in considerably shorter time than Gauss-Seidel running on CPUs- or GPU-based Jacobi. Hence, it overcompensates for the slower convergence by exploiting the scalability and the good fit of the asynchronous schemes for the highly parallel GPU architectures. Further, enhancing the most basic asynchronous approach with hybrid schemes-using multiple iterations within the ''subdomain'' handled by a GPU thread block-we manage to not only recover the loss of global convergence but often accelerate convergence of up to two times, while keeping the execution time of a global iteration practically the same. The combination with the advantageous properties of asynchronous iteration methods with respect to hardware failure identifies the high potential of the asynchronous methods for Exascale computing.

Topics: Asynchronous communication (59%), CUDA (54%), Rate of convergence (51%), Relaxation (iterative method) (51%), Scalability (50%)

Summary (3 min read)

Introduction

  • The authors analyze the potential of asynchronous relaxation methods on Graphics Processing Units (GPUs).
  • The latest developments in hardware architectures show an enormous increase in the number of processing units (computing cores) that form one processor.
  • On the other hand, numerical algorithms usually require this synchronization.
  • In the following section the authors analyze the experiment results with focus on the convergence behavior and the iteration times for the different matrix systems.

B. Asynchronous Iteration Methods

  • For computing the next iteration in a relaxation method, one usually requires the latest values of all components.
  • The question of interest that the authors want to investigate is, what happens if this order is not adhered.
  • Since the synchronization usually thwarts the overall performance, it may be true that the asynchronous iteration schemes overcompensate the inferior convergence behavior by superior scalability.
  • (3) Furthermore, the following conditions can be defined to guarantee the well-posedness of the algorithm [17]: 1) The update function u(·) takes each of the values l for 1 ≤ l ≤ N infinitely often.
  • The AGS method uses new values of unknowns in its subsequent updates as soon as they are computed in the same iteration, while the AJ method uses only values that are set at the beginning of an iteration.

A. Linear Systems of Equations

  • In their experiments, the authors search for the approximate solutions of linear system of equations, where the respective matrices are taken from the University of Florida Matrix Collection (UFMC; see http://www.cise.ufl.edu/research/ sparse/matrices/).
  • Due to the convergence properties of the iterative methods considered the experiment matrices have to be properly chosen.
  • While for the Jacobi method a sufficient condition for convergence is clearly ρ(M) = ρ(I − D−1A) < 1 (i.e., the spectral radius of the iteration matrix M to be smaller than one), the convergence theory for asynchronous iteration methods is more involved (and is not the subject of this paper).
  • The matrices and their descriptions are summarized in Table II, their structures can be found in Figure 1.
  • The authors furthermore take the number of right-hand sides to be one for all linear systems.

B. Hardware and Software Issues

  • The experiments were conducted on a heterogeneous GPU-accelerated multicore system located at the University of Tennessee, Knoxville.
  • In the synchronous implementation of Gauss-Seidel on the CPU, 4 cores are used for the matrix-vector operations that can be parallelized.
  • The component updates were coded in CUDA, using thread blocks of size 512.

C. An asynchronous iteration method for GPUs

  • The asynchronous iteration method for GPUs that the authors propose is split into two levels.
  • This is due to the design of graphics processing units and the CUDA programming language.
  • For these thread blocks, a PA iteration method is used, while on each thread block, a Jacobi-like iteration method is performed.
  • During the local iterations the x values used from outside the block are kept constant (equal to their values at the beginning of the local iterations).
  • The shift function ν(m + 1, j) denotes the iteration shift for the component j - this can be positive or negative, depending on whether the respective other thread block already has conducted more or less iterations.

A. Stochastic impact of chaotic behavior of asynchronous iteration methods

  • At this point it should be mentioned, that only the synchronous Gauss-Seidel and Jacobi methods are deterministic.
  • For the asynchronous iteration method on the GPU, the results are not reproducible at all, since for every iteration run, a very unique pattern of component updates is conducted.
  • It may be possible, that another component update order may result in faster or slower convergence.
  • The results reported in Table IV are based on 100 simulation runs on the test matrix FV3.
  • Analyzing the results, the authors observe only small variations in the convergence behavior: for 1000 iterations the relative residual is improved by 5 · 10−2, the maximal variation between the fastest and the slowest convergence rate is in the order of 10−5.

B. Convergence rate of the asynchronous iteration method

  • In the next experiment, the authors analyze the convergence behavior of the asynchronous iteration method and compare it with the convergence rate of the Gauss-Seidel and Jacobi method.
  • The experiment results, summarized in Figures 4, 5, 6, 7 and 9, show that for test systems CHEM97ZTZ, FV1, FV2, FV3 and TREFETHEN 2000 the synchronous Gauss-Seidel algorithm converges in considerably less iterations.
  • This superior convergence behaviour is intuitively expected, since the synchronization after each component update allows the use the updated components immediately for the next update.
  • Still, the authors observe for all test cases convergence rates similar to the synchronized counterpart, which is also almost doubled compared to Gauss-Seidel.
  • The results for test matrix S1RMT3M1 show an example where neither of the methods is suitable for direct use.

C. Block-asynchronous iteration method

  • The authors now consider a block-asynchronous iteration method which additionally performs a few Jacobi-like iterations on every subdomain.
  • A motivation for this approach is hardware related – specifically, this is the fact that the additional local iterations almost come for free (as the subdomains are relatively small and the data needed largely fits into the multiprocessors’ caches).
  • The case for TREFETHEN 2000 is similar – although there is improvement compared to Jacobi, the rate of convergence for async-(5) is not twice better than Gauss-Seidel, and the reason is again the structure of the local matrices .
  • Due to the large overhead when performing only a small number of iterations, the average computation time per iteration decreases significantly for cases where a large number of iterations is conducted.
  • Overall, the authors observe, that the average iteration time for the async-(5) method using the GPU is only a fraction of the time needed to conduct one iteration of the synchronous Gauss-Seidel on the CPU.

D. Performance of the block-asynchronous iteration method

  • To analyze the performance of the block-asynchronous iteration method, the authors show in Figures 17, 18 19, 20, and 21 the average time needed for the synchronous GaussSeidel, the synchronous Jacobi and the block-asynchronous iteration method to provide a solution approximation of certain accuracy, relative to the initial residual.
  • Since the convergence of the Gauss-Seidel method for S1RMT3M1 is almost negligible, and the Jacobi and the asynchronous iteration does not converge at all, the authors limit this analysis to the linear systems of equations CHEM97ZTZ, FV1, FV2, FV3 and TREFETHEN 2000.
  • This is due to the fact that for these high iteration numbers the overhead triggered by memory transfer and GPU kernel call has minor impact.
  • For linear equation systems with considerable off-diagonal part e.g.
  • Still, due to the faster kernel execution the block-asynchronous iteration provides the solution approximation in shorter time.

V. CONCLUSIONS

  • The authors developed asynchronous relaxation methods for highly parallel architectures.
  • The experiments have revealed the potential of using them on GPUs.
  • The absence of synchronization points enables not only to reach a high scalability, but also to efficiently use the GPU architecture.
  • Nevertheless, the numerical properties of asynchronous iteration pose some restrictions on the usage.
  • The presented approach could be embedded in a multigrid framework, replacing the traditional Gauss-Seidel based smoothers.

Did you find this useful? Give us your feedback

...read more

Content maybe subject to copyright    Report

A Block-Asynchronous Relaxation
Method for Graphics Processing Units
H. Anzt, J. Dongarra, V. Heuveline, S. Tomov
No. 2011-14
KIT University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association
www.emcl.kit.edu
Preprint Series of the Engineering Mathematics and Computing Lab (EMCL)

Preprint Series of the Engineering Mathematics and Computing Lab (EMCL)
ISSN 2191–0693
No. 2011-14
Impressum
Karlsruhe Institute of Technology (KIT)
Engineering Mathematics and Computing Lab (EMCL)
Fritz-Erler-Str. 23, building 01.86
76133 Karlsruhe
Germany
KIT University of the State of Baden Wuerttemberg and
National Laboratory of the Helmholtz Association
Published on the Internet under the following Creative Commons License:
http://creativecommons.org/licenses/by-nc-nd/3.0/de .
www.emcl.kit.edu

A Block-Asynchronous Relaxation Method for Graphics Processing Units
Hartwig Anzt
, Stanimire Tomov
, Jack Dongarra
§
and Vincent Heuveline
Karlsruhe Institute of Technology, Germany
University of Tennessee Knoxville, USA
Oak Ridge National Laboratory, USA
§
University of Manchester, UK
{hartwig.anzt, vincent.heuveline}@kit.edu
{tomov, dongarra}@cs.utk.edu
Abstract—In this paper, we analyze the potential of asyn-
chronous relaxation methods on Graphics Processing Units
(GPUs). For this purpose, we developed a set of asynchronous
iteration algorithms in CUDA and compared them with a
parallel implementation of synchronous relaxation methods on
CPU-based systems. For a set of test matrices taken from
the University of Florida Matrix Collection we monitor the
convergence behavior, the average iteration time and the total
time-to-solution time. Analyzing the results, we observe that
even for our most basic asynchronous relaxation scheme,
despite its lower convergence rate compared to the Gauss-Seidel
relaxation (that we expected), the asynchronous iteration run-
ning on GPUs is still able to provide solution approximations
of certain accuracy in considerably shorter time than Gauss-
Seidel running on CPUs. Hence, it overcompensates for the
slower convergence by exploiting the scalability and the good
fit of the asynchronous schemes for the highly parallel GPU
architectures. Further, enhancing the most basic asynchronous
approach with hybrid schemes using multiple iterations
within the ”subdomain” handled by a GPU thread block
and Jacobi-like asynchronous updates across the ”boundaries”,
subject to tuning various parameters we manage to not only
recover the loss of global convergence but often accelerate
convergence of up to two times (compared to the standard
but difficult to parallelize Gauss-Seidel type of schemes), while
keeping the execution time of a global iteration practically
the same. This shows the high potential of the asynchronous
methods not only as a stand alone numerical solver for linear
systems of equations fulfilling certain convergence conditions
but more importantly as a smoother in multigrid methods. Due
to the explosion of parallelism in todays architecture designs,
the significance and the need for asynchronous methods, as the
ones described in this work, is expected to grow.
Keywords-Asynchronous Relaxation; Chaotic Iteration;
Graphics Processing Units (GPUs); Jacobi Method;
I. INTRODUCTION
The latest developments in hardware architectures show
an enormous increase in the number of processing units
(computing cores) that form one processor. The reason for
this varies from various physical limitations to energy mini-
mization considerations that are at odds with further scaling
up of processor’ frequencies the basic acceleration method
used in the architecture designs for the last decades [1]. Only
by merging multiple processing units into one processor does
further acceleration seem possible. One example where this
core gathering is carried to extremes is the GPU. The current
high-end products of the leading GPU providers consist of
448 CUDA cores for the NVIDIA Fermi generation [2] and
3072 stream processors for the Northern Islands generation
from ATI [3]. While the original purpose of GPUs was
graphics processing, their enormous computing power also
suggests the usage as accelerators when performing parallel
computations. Yet, the design and characteristics of these
devices pose some challenges for their efficient use. In
particular, since the synchronization between the individual
processing units usually triggers considerable overhead, it is
attractive to employ algorithms that have a high degree of
parallelism and only very few synchronization points.
On the other hand, numerical algorithms usually require
this synchronization. For example, when solving linear sys-
tems of equations with iterative methods like the Conjugate
Gradient or GMRES, the parallelism is usually limited to
the matrix-vector and the vector-vector operations (with
synchronization required between them) [4] [5] [6]. Also,
methods that are based on component-wise updates like
Jacobi or Gauss-Seidel have synchronization between the
iteration steps [7] [8]: no component is updated twice (or
more) before all other components are updated. Still, it is
possible to ignore these synchronization steps, which will
result in a chaotic or asynchronous iteration process. Despite
the fact that the numerical robustness and convergence prop-
erties severely suffer from this chaotic behavior, they may
be interesting for specific applications, since the absence
of synchronization points make them perfect candidates for
highly parallel hardware platforms. The result is a trad-
off: while the algorithm’s convergence may suffer from the
asynchronism, the performance can benefit from the superior
scalability.
In this paper, we want to analyze the potential of employ-
ing asynchronous iteration methods on GPUs by analyzing
convergence behavior and time-to-solution when iteratively
solving linear systems of equations. We split this paper
into the following parts: First, we will shortly recall the
mathematical idea of the Jacobi iteration method and derive
the component wise iteration algorithm. Then the idea of an
asynchronous relaxation method is derived, and some ba-

sic characteristics concerning the convergence demands are
summarized. The section about the experiment framework
will first provide information about the linear systems of
equations we target. The matrices affiliated with the systems
are taken from the University of Florida matrix collection.
Then we describe the asynchronous iteration method for
GPUs that we designed. In the following section we analyze
the experiment results with focus on the convergence behav-
ior and the iteration times for the different matrix systems. In
section V we summarize the results and provide an outlook
about future work in this field.
II. MATHEMATICAL BACKGROUND
A. Jacobi Method
The Jacobi method is an iterative algorithm for finding
the approximate solution for a linear system of equations
Ax = b, (1)
where A is strictly or irreducibly diagonally dominant. One
can rewrite the system as (L + D + U )x = b where D
denotes the diagonal entries of A while L and U denote the
lower and upper triangular part of A, respectively. Using the
form Dx = b (L + U)x, the Jacobi method is derived as
an iterative scheme
x
m+1
= D
1
(b (L + U )x
m
).
Denoting the error at iteration m + 1 by e
m+1
x
m+1
x,
this scheme can also be rewritten as e
m+1
= (ID
1
A)e
m
.
The matrix M I D
1
A is often referred to as iteration
matrix. The Jacobi method provides a sequence of solution
approximations with increasing accuracy when the spectral
radius of the iteration matrix M is less than one (i.e.,
ρ(M) < 1) [9].
The Jacobi method can also be rewritten in the following
component-wise form:
x
m+1
i
=
1
a
ii
b
i
X
j6=i
a
ij
x
m
j
. (2)
B. Asynchronous Iteration Methods
For computing the next iteration in a relaxation method,
one usually requires the latest values of all components.
For some algorithms, e.g. Gauss-Seidel [7], even the already
computed values of the current iteration step are used. This
requires a strict order of the component updates, limiting the
parallelization potential to a stage, where no component can
be updated several times before all the other components are
updated.
The question of interest that we want to investigate is,
what happens if this order is not adhered. Since in this
case, the individual components are updated independently
and without consideration of the current state of the other
components, the resulting algorithm is called chaotic or
asynchronous iteration method. Back in the 70’s Chazan and
Miranker analyzed some basic properties of these methods,
and established convergence theory [10]. In the last 30 years,
these algorithms were subject of dedicated research activities
[11], [12] [13] [14] [15] [16]. However, they did not play a
significant role in high-performance computing, due to the
superior convergence properties of synchronized iteration
methods. Today, due to the complexity of heterogeneous
hardware platforms and the high number of computing units
in parallel devices like GPUs, these schemes may become
interesting again: they do not require explicit synchroniza-
tion between the computing cores, probably even located in
distinct hardware devices. Since the synchronization usually
thwarts the overall performance, it may be true that the
asynchronous iteration schemes overcompensate the inferior
convergence behavior by superior scalability.
The chaotic-, or asynchronous-relaxation scheme defined
by Chazan and Miranker [10] can be characterized by two
functions, an update function u(·) and a shift function s(·, · ) .
For each non-negative integer ν, the component of the
solution approximation x that is updated at step ν is given
by u(ν). For the update at step ν, the m
th
component used
in this step is s(ν, m) steps back. All the other components
are kept. This can be expressed as:
x
ν+1
l
=
(
P
N
m=1
b
l,m
x
νs(ν,m)
m
+ d
l
if l = u(ν)
x
ν
l
if l 6= u(ν).
(3)
Furthermore, the following conditions can be defined to
guarantee the well-posedness of the algorithm [17]:
1) The update function u(·) takes each of the values l for
1 l N infinitely often.
2) The shift function s(·, ·) is bounded by some ¯s such
that 0 s(ν, m) ¯s ν {1, 2, . . . }, m
{1, 2, . . . , N }. For the initial step, we additionally
require s(ν, m) ν.
3) The shift function s(·, ·) is independent of m.
If these conditions are satisfied and ρ(|M|) < 1 (i.e., the
spectral radius of the iteration matrix, taking the absolute
values for its elements, to be smaller than one), the conver-
gence of the asynchronous method is fulfilled [17].
Depending on the exchange of the updated components,
Baudet classified the asynchronous iterative methods into
three sub-methods [18]:
1) The purely asynchronous method (PA);
2) The asynchronous Jacobi method (AJ);
3) The asynchronous Gauss-Seidel method (AGS).
The PA method releases each new value immediately after
its computation, while the AJ and AGS methods exchange
new values only at the end of each iteration. The only
difference between the AJ and AGS methods is the choice
of the values of unknowns within each iteration. The AGS
method uses new values of unknowns in its subsequent

method broadcast used values bound for shift
PA Immediately Latest available |s(ν, m)| < ¯s
AJ End of Iter. Begin of Iter. 0 s(ν, m) < ¯s
AGS End of Iter. Latest available 1 s(ν, m) < ¯s
Table I: Basic properties of the different subclasses of
asynchronous iteration methods.
updates as soon as they are computed in the same iteration,
while the AJ method uses only values that are set at the
beginning of an iteration. In general, the term asynchronous
iteration method that we use refers to the PA method. The
basic properties are summarized in Table I.
Since the barrier synchronization between the iterations
is usually daunting when using highly parallel devices,
the purely asynchronous method is most suitable for both
communication- and synchronization-avoiding iterative im-
plementations.
The GPU implementation of the asynchronous iteration
method that we consider in III-C is of purely asynchronous
nature. For convenience, from now on we will use the term
asynchronous iteration method if we refer to the PA iteration.
III. EXPERIMENT FRAMEWORK
A. Linear Systems of Equations
In our experiments, we search for the approximate solu-
tions of linear system of equations, where the respective
matrices are taken from the University of Florida Ma-
trix Collection (UFMC; see http://www.cise.ufl.edu/research/
sparse/matrices/). Due to the convergence properties of the
iterative methods considered the experiment matrices have
to be properly chosen. While for the Jacobi method a
sufficient condition for convergence is clearly ρ(M ) =
ρ(I D
1
A) < 1 (i.e., the spectral radius of the iteration
matrix M to be smaller than one), the convergence theory
for asynchronous iteration methods is more involved (and is
not the subject of this paper). In [17] John C. Strikwerda
has shown, that a sufficient condition for the asynchronous
iteration to converge for all update and shift functions
satisfying conditions (1), (2) and (3) in II-B is the condition
ρ(|M|) < 1, where |M | is derived from M by replacing its
elements by their corresponding absolute values.
Due to these considerations, we choose to only analyze sym-
metric, positive definite systems, where the Jacobi method
converges. The matrices and their descriptions are summa-
rized in Table II, their structures can be found in Figure
1. Table III additionally provides some of the convergence
related characteristics of the test matrices as well as of their
corresponding iteration matrices.
We furthermore take the number of right-hand sides to be
one for all linear systems.
B. Hardware and Software Issues
The experiments were conducted on a heterogeneous
GPU-accelerated multicore system located at the University
Matrix name Description #n #nnz
CHEM97ZTZ statistical problem 2,541 7,361
FV1 2D/3D problem 9,604 85,264
FV2 2D/3D problem 9,801 87,025
FV3 2D/3D problem 9,801 87,025
S1RMT3M1 structural problem 5,489 262,411
TREFETHEN 2000 combinatorial problem 2,000 41,906
Table II: Dimension and characteristics of the SPD test
matrices.
Matrix name cond(A) cond(D
1
A) ρ(M)
CHEM97ZTZ 1.3e+03 7.2e+03 0.7889
FV1 9.3e+04 12.76 0.8541
FV2 9.5e+04 12.76 0.8541
FV3 3.6e+07 4.4e+03 0.9993
S1RMT3M1 2.2e+06 7.2e+06 2.65
TREFETHEN 2000 5.1e+04 6.1579 0.8601
Table III: Convergence characteristics of the test matrices
and of their corresponding iteration matrices.
of Tennessee, Knoxville. The system’s CPU is one socket
Intel Core Quad Q9300 @ 2.50GHz and the GPU is a Fermi
C2050 (14 Multiprocessors x 32 CUDA cores @1.15GHz,
3 GB memory). The GPU is connected to the CPU host
through a PCI-e×16.
In the synchronous implementation of Gauss-Seidel on
the CPU, 4 cores are used for the matrix-vector operations
that can be parallelized. Intel compiler 11.1.069 [19] is used
with optimization flag “-O3”. The GPU implementations of
the asynchronous iteration and the Jacobi method are based
on CUDA [20], while the respective libraries used are from
CUDA 4.0.17 [21]. The component updates were coded in
CUDA, using thread blocks of size 512. The kernels are then
(a) CHEM97ZTZ (b) FV, FV2, FV3
(c) S1RMT3M1 (d) TREFETHEN 2000
Figure 1: Sparsity plots of test matrices.

Citations
More filters

Proceedings ArticleDOI
17 Nov 2013
TL;DR: It is shown how to use the idea of self-stabilization, which originates in the context of distributed control, to make fault-tolerant iterative solvers, and has promise to become a useful tool for constructing resilient solvers more generally.
Abstract: We show how to use the idea of self-stabilization, which originates in the context of distributed control, to make fault-tolerant iterative solvers. Generally, a self-stabilizing system is one that, starting from an arbitrary state (valid or invalid), reaches a valid state within a finite number of steps. This property imbues the system with a natural means of tolerating transient faults. We give two proof-of-concept examples of self-stabilizing iterative linear solvers: one for steepest descent (SD) and one for conjugate gradients (CG). Our self-stabilized versions of SD and CG require small amounts of fault-detection, e.g., we may check only for NaNs and infinities. We test our approach experimentally by analyzing its convergence and overhead for different types and rates of faults. Beyond the specific findings of this paper, we believe self-stabilization has promise to become a useful tool for constructing resilient solvers more generally.

96 citations


Book ChapterDOI
24 Aug 2015
TL;DR: This work proposes using an iterative approach for solving sparse triangular systems when an approximation is suitable, and demonstrates the performance gains that this approach can have on GPUs in the context of solving sparse linear systems with a preconditioned Krylov subspace method.
Abstract: Sparse triangular solvers are typically parallelized using level-scheduling techniques, but parallel efficiency is poor on high-throughput architectures like GPUs. We propose using an iterative approach for solving sparse triangular systems when an approximation is suitable. This approach will not work for all problems, but can be successful for sparse triangular matrices arising from incomplete factorizations, where an approximate solution is acceptable. We demonstrate the performance gains that this approach can have on GPUs in the context of solving sparse linear systems with a preconditioned Krylov subspace method. We also illustrate the effect of using asynchronous iterations.

62 citations


Book ChapterDOI
12 Jul 2015
TL;DR: This paper presents a GPU implementation of an asynchronous iterative algorithm for computing incomplete factorizations that considers several non-traditional techniques that can be important for asynchronous algorithms to optimize convergence and data locality.
Abstract: This paper presents a GPU implementation of an asynchronous iterative algorithm for computing incomplete factorizations. Asynchronous algorithms, with their ability to tolerate memory latency, form an important class of algorithms for modern computer architectures. Our GPU implementation considers several non-traditional techniques that can be important for asynchronous algorithms to optimize convergence and data locality. These techniques include controlling the order in which variables are updated by controlling the order of execution of thread blocks, taking advantage of cache reuse between thread blocks, and managing the amount of parallelism to control the convergence of the algorithm.

36 citations


Journal ArticleDOI
TL;DR: The results show that the proposed algorithm has achieved an acceptable performance for diagnosis of AML and its common subtypes and can be used as an assistant diagnostic tool for pathologists.
Abstract: Acute myelogenous leukemia (AML) is a subtype of acute leukemia, which is characterized by the accumulation of myeloid blasts in the bone marrow. Careful microscopic examination of stained blood smear or bone marrow aspirate is still the most significant diagnostic methodology for initial AML screening and considered as the first step toward diagnosis. It is time-consuming and due to the elusive nature of the signs and symptoms of AML; wrong diagnosis may occur by pathologists. Therefore, the need for automation of leukemia detection has arisen. In this paper, an automatic technique for identification and detection of AML and its prevalent subtypes, i.e., M2-M5 is presented. At first, microscopic images are acquired from blood smears of patients with AML and normal cases. After applying image preprocessing, color segmentation strategy is applied for segmenting white blood cells from other blood components and then discriminative features, i.e., irregularity, nucleus-cytoplasm ratio, Hausdorff dimension, shape, color, and texture features are extracted from the entire nucleus in the whole images containing multiple nuclei. Images are classified to cancerous and noncancerous images by binary support vector machine (SVM) classifier with 10-fold cross validation technique. Classifier performance is evaluated by three parameters, i.e., sensitivity, specificity, and accuracy. Cancerous images are also classified into their prevalent subtypes by multi-SVM classifier. The results show that the proposed algorithm has achieved an acceptable performance for diagnosis of AML and its common subtypes. Therefore, it can be used as an assistant diagnostic tool for pathologists.

32 citations


Proceedings ArticleDOI
21 May 2012
TL;DR: A set of asynchronous iteration algorithms in CUDA developed and compared with a parallel implementation of synchronous relaxation methods on CPU-based systems shows the high potential of the asynchronous methods not only as a stand alone numerical solver for linear systems of equations fulfilling certain convergence conditions but more importantly as a smoother in multigrid methods.
Abstract: In this paper, we analyze the potential of asynchronous relaxation methods on Graphics Processing Units (GPUs). For this purpose, we developed a set of asynchronous iteration algorithms in CUDA and compared them with a parallel implementation of synchronous relaxation methods on CPU-based systems. For a set of test matrices taken from the University of Florida Matrix Collection we monitor the convergence behavior, the average iteration time and the total time-to-solution time. Analyzing the results, we observe that even for our most basic asynchronous relaxation scheme, despite its lower convergence rate compared to the Gauss-Seidel relaxation (that we expected), the asynchronous iteration running on GPUs is still able to provide solution approximations of certain accuracy in considerably shorter time than Gauss-Seidel running on CPUs. Hence, it overcompensates for the slower convergence by exploiting the scalability and the good fit of the asynchronous schemes for the highly parallel GPU architectures. Further, enhancing the most basic asynchronous approach with hybrid schemes -- using multiple iterations within the "sub domain" handled by a GPU thread block and Jacobi-like asynchronous updates across the "boundaries", subject to tuning various parameters -- we manage to not only recover the loss of global convergence but often accelerate convergence of up to two times (compared to the standard but difficult to parallelize Gauss-Seidel type of schemes), while keeping the execution time of a global iteration practically the same. This shows the high potential of the asynchronous methods not only as a stand alone numerical solver for linear systems of equations fulfilling certain convergence conditions but more importantly as a smoother in multigrid methods. Due to the explosion of parallelism in today's architecture designs, the significance and the need for asynchronous methods, as the ones described in this work, is expected to grow.

13 citations


Cites background or methods or result from "A block-asynchronous relaxation met..."

  • ...As the linear system combines small dimension with a low condition number, both enabling fast convergence, the overhead triggered by the GPU kernel calls is crucial for small iteration numbers [1]....

    [...]

  • ...Similar tests on diagonal dominant systems with the same sparsity pattern but higher condition number reveal that the latter one has only small impact on the variations between the individual solver runs; see [1]....

    [...]

  • ...This stems from the fact that the entries located outside the subdomains are not taken into account for the local iterations in async(5) [1]....

    [...]

  • ...Even if we iterate every component locally by 9 Jacobi iterations, the overhead is less than 35%, while the total updates for every component is increased by a factor of 9 [1]....

    [...]

  • ...Hence, as long as the asynchronous method converges and the offblock entries are ‘‘small’’, adding local iterations may be used to not only compensate for the convergence loss due to the chaotic behavior, but moreover to gain significant overall convergence improvements [1]....

    [...]


References
More filters

Book
01 Apr 2003
TL;DR: This chapter discusses methods related to the normal equations of linear algebra, and some of the techniques used in this chapter were derived from previous chapters of this book.
Abstract: Preface 1. Background in linear algebra 2. Discretization of partial differential equations 3. Sparse matrices 4. Basic iterative methods 5. Projection methods 6. Krylov subspace methods Part I 7. Krylov subspace methods Part II 8. Methods related to the normal equations 9. Preconditioned iterations 10. Preconditioning techniques 11. Parallel implementations 12. Parallel preconditioners 13. Multigrid methods 14. Domain decomposition methods Bibliography Index.

12,575 citations


"A block-asynchronous relaxation met..." refers methods in this paper

  • ...For example, when solving linear systems of equations with iterative methods like the Conjugate Gradient or GMRES, the parallelism is usually limited to the matrix-vector and the vector-vector operations (with synchronization required between them) [4] [5] [6]....

    [...]


Journal ArticleDOI
TL;DR: An iterative method for solving linear systems, which has the property of minimizing at every step the norm of the residual vector over a Krylov subspace.
Abstract: We present an iterative method for solving linear systems, which has the property of minimizing at every step the norm of the residual vector over a Krylov subspace. The algorithm is derived from t...

10,155 citations


Book
01 Jan 1987
TL;DR: Preface How to Get the Software How to get the Software Part I.
Abstract: Preface How to Get the Software Part I. Linear Equations. 1. Basic Concepts and Stationary Iterative Methods 2. Conjugate Gradient Iteration 3. GMRES Iteration Part II. Nonlinear Equations. 4. Basic Concepts and Fixed Point Iteration 5. Newton's Method 6. Inexact Newton Methods 7. Broyden's Method 8. Global Convergence Bibliography Index.

2,427 citations


"A block-asynchronous relaxation met..." refers methods in this paper

  • ...For some algorithms, e.g. Gauss-Seidel [7], even the already computed values of the current iteration step are used....

    [...]

  • ...Also, methods that are based on component-wise updates like Jacobi or Gauss- Seidel have synchronization between the iteration steps [7], [8]: no component is updated twice (or more) before all other components are updated....

    [...]


Journal ArticleDOI
01 Feb 2011
TL;DR: The work of the community to prepare for the challenges of exascale computing is described, ultimately combing their efforts in a coordinated International Exascale Software Project.
Abstract: Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual petascale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/ exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and graphics processing units. This report describes the work of the community to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project.

705 citations


"A block-asynchronous relaxation met..." refers methods in this paper

  • ...The reason for this varies from various physical limitations to energy minimization considerations that are at odds with further scaling up of processor' frequencies – the basic acceleration method used in the architecture designs for the last decades [1]....

    [...]


Journal ArticleDOI
TL;DR: A class of asynchronous iterative methods is presented for solving a system of equations corresponding to a parallel implementation on a multiprocessor system with no synchronization between cooperating processes to show clearly the advantage of purely asynchronous Iterative methods.
Abstract: : A class of asynchronous iterative methods is presented for solving a system of equations. Existing iterative methods are identified in terms of asynchronous iterations, and new schemes are introduced corresponding to a parallel implementation on a multiprocessor system with no synchronization between cooperating processes. A sufficient condition is given to guarantee the convergence of any asynchronous iterations, and results are extended to include iterative methods with memory. Asynchronous iterative methods are then evaluated from a computational point of view, and bounds are derived for the efficiency. The bounds are compared with actual measurements obtained by running various asynchronous iterations on a multiprocessor, and the experimental results show clearly the advantage of purely asynchronous iterative methods. (Author)

520 citations