scispace - formally typeset
Open AccessProceedings ArticleDOI

A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures

Reads0
Chats0
TLDR
This work presents a class of algorithms based largely on software infrastructures that have already been developed for homogeneous multicores and hybrid GPU-based computing that extend what is currently available in the Matrix Algebra for GPU and Multicore Architectures (MAGMA) Library for performing Cholesky, QR, and LU factorizations using a single core or socket and a single GPU.
Abstract
Three out of the top four supercomputers in the November 2010 TOP500 list of the world's most powerful supercomputers use NVIDIA GPUs to accelerate computations. Ninety-five systems from the list are using processors with six or more cores. Three-hundred-sixty-five systems use quad-core processor-based systems. Thirty-seven systems are using dual-core processors. The large-scale enabling of hybrid graphics processing unit (GPU)-based multicore platforms for computational science by developing fundamental numerical libraries (in particular, libraries in the area of dense linear algebra) for them has been underway for some time. We present a class of algorithms based largely on software infrastructures that have already been developed for homogeneous multicores and hybrid GPU-based computing. The algorithms extend what is currently available in the Matrix Algebra for GPU and Multicore Architectures (MAGMA) Library for performing Cholesky, QR, and LU factorizations using a single core or socket and a single GPU. The extensions occur in two areas. First, panels factored on the CPU using LAPACK are, instead, done in parallel using a highly optimized dynamic asynchronous scheduled algorithm on some number of CPU cores. Second, the remaining CPU cores are used to update the rightmost panels of the matrix in parallel.

read more

Content maybe subject to copyright    Report

A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures
Mitch Horton
, Stanimire Tomov
and Jack Dongarra
∗†
Department of Electrical Engineering and Computer Science
University of Tennessee
Knoxville, TN 37996
Email: {horton, tomov, dongarra}@eecs.utk.edu
Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee
School of Mathematics & School of Computer Science, University of Manchester
Abstract—Three out of the top four supercomputers in the
November 2010 TOP500 list of the world’s most powerful
supercomputers use NVIDIA GPUs to accelerate computations.
Ninety-five systems from the list are using processors with six
or more cores. Three-hundred-sixty-five systems use quad-core
processor-based systems. Thirty-seven systems are using dual-
core processors. The large-scale enabling of hybrid graphics
processing unit (GPU)-based multicore platforms for compu-
tational science by developing fundamental numerical libraries
(in particular, libraries in the area of dense linear algebra) for
them has been underway for some time. We present a class
of algorithms based largely on software infrastructures that
have already been developed for homogeneous multicores and
hybrid GPU-based computing. The algorithms extend what is
currently available in the Matrix Algebra for GPU and Multicore
Architectures (MAGMA) Library for performing Cholesky, QR,
and LU factorizations using a single core or socket and a single
GPU. The extensions occur in two areas. First, panels factored
on the CPU using LAPACK are, instead, done in parallel using
a highly optimized dynamic asynchronous scheduled algorithm
on some number of CPU cores. Second, the remaining CPU
cores are used to update the rightmost panels of the matrix in
parallel.
Keywords-GPU; multicore; QR; LU; Cholesky;
I. INTRODUCTION
Until roughly 2004, microprocessor manufacturers were
able to achieve higher performance by exploiting higher
degrees of instruction level parallelism (ILP). Based on this
approach, several generations of processors were built where
clock frequencies were higher and higher and pipelines were
deeper and deeper. As a result, applications could benefit
from these innovations and achieve higher performance
simply by relying on compilers that could efficiently exploit
ILP. Due to a number of physical limitations (mostly power
consumption and heat dissipation) this approach cannot be
pushed any further. For this reason, chip designers have
moved their focus from ILP to thread level parallelism (TLP)
where higher performance can be achieved by replicating
execution units (or cores) on the die while keeping the
clock rates in a range where power consumption and heat
dissipation do not represent a problem [1]–[3]. CPU designs
have moved to multicores and are currently going through
a renaissance due to the need for new approaches to man-
age the exponentially increasing (a) appetite for power of
conventional system designs, and (b) gap between compute
and communication speeds. Compute Unified Device Ar-
chitecture (CUDA) [4] based multicore platforms stand out
among a confluence of trends because of their low power
consumption and, at the same time, high compute power
and bandwidth [3]. Because of the prevalence of multicore
and GPU architectures in the TOP500 list [5]; the existence
of current conferences and workshops with emphasis on
multicore and GPU technology [6]–[33]; the long list of
GPU related success stories across academia, industry, and
national research laboratories for specific applications and
algorithms [34]–[46]; books related to general purpose GPU
computing [47]–[49]; the emergence of compilers that un-
derstand GPU directives [50]–[53]; language in the current
Exascale roadmap concerning heterogeneity in general and
general purpose GPU programming in particular [54]; the
fact that NVIDIA did $100 million in revenue from high
performance computing last year, up from zero three years
ago [55]; and relentless architectural advancements [56]–
[63], it is clear that multicore processors and GPUs represent
the future of high performance computing.
As multicore and GPU systems continue to gain ground
in the high performance computing world, linear algebra
algorithms have to be reformulated, or new algorithms
have to be developed, in order to take advantage of the
architectural features on these new architectures [64]. This
work is a contribution to the development of these algorithms
in the area of dense linear algebra, and will be included
in the Matrix Algebra for GPU and Multicore Architec-
tures (MAGMA) Library [38]. Designed to be similar to
LAPACK [65] in functionality, data storage, and interface,
the MAGMA library allows scientists to effortlessly port
their LAPACK-relying software components and to take
advantage of the new hybrid architectures.
The challenges in developing scalable high performance
algorithms for multicore with GPU accelerators systems
stem from their heterogeneity, massive parallelism, and the
huge gap between the GPUs’ compute power vs. the CPU-
GPU communication speed. We show an approach that is
largely based on software infrastructures that have already

been developed namely, the QUeuing And Runtime for
Kernels (QUARK) dynamic scheduler [66] and the MAGMA
[38] library. The approach extends what is currently avail-
able in the MAGMA Library for performing Cholesky, QR,
and LU factorizations using a single core or socket and a
single GPU. The extensions occur in two areas. First, panels
factored on the CPU using LAPACK are, instead, done in
parallel using a highly optimized dynamic asynchronous
QUARK scheduled algorithm on some number of CPU
cores. Second, the remaining CPU cores are used to update
the rightmost panels of the matrix in parallel. The approach
aims to better utilize all available hardware.
The results of this work are communicated using the
QR algorithm as a framework. The Cholesky and LU
algorithms are similar in implementation. The paper is
organized as follows. Section II provides an overview of
the QR factorization. Section III illustrates how the QR
factorization is performed by the MAGMA library using
a single core or socket and a single GPU. Section IV
describes the new approach, outlining how it differs from
what is currently available in the MAGMA library. Section V
briefly describes the QUARK dynamic scheduler. Section VI
discusses autotuning. In particular, an explanation is given
of how to for a given matrix size, precision, architecture,
and algorithm choose the optimal number of cores for
panel factorization, number of cores for panel updates, panel
width, outer panel width, and inner panel width. Section
VII describes algorithm optimization with respect to panel
factorization. Section VIII presents results on two different
architectures: a single NVIDIA GeForce GTX480 GPU
with fifteen cores (streaming multiprocessors) @1.401 GHz
connected to eight six-core Intel Xeon X5660 Westmere
@2.8 GHz processors and a single NVIDIA Telsa M2070
GPU with fourteen cores (streaming mulitprocessors) @1.15
GHz connected to two six-core Intel Xeon X5660 Westmere
@2.8 GHz processors. Finally, section IX discusses future
work.
II. BLOCK QR FACTORIZATION
This section contains a high level explanation of the block
QR factorization implemented by LAPACK. The explanation
will facilitate an understanding of the description of the new
approach given in Section IV. A detailed discussion of the
block QR factorization can be found here [67]–[71].
Stewart refers to the QR factorization as, ”the great
success story of modern computation, [72]. Trefethen and
Bau say, ”One algorithmic idea in numerical linear algebra
is more important than all the others: QR factorization [70].
It is used for solving linear systems of equations [64], [68],
solving the linear least squares problem [65], [73], com-
puting eigenvalues and eigenvectors [72], [74], computing
the SVD [72], [74], and computing an orthonormal basis
for a set of vectors [68]. Stewart says, ”the underlying
theory of the method continues to suggest new algorithms.
[72] Golub and Van Loan give QR algorithms based on
Householder, block Householder, Givens, and fast Givens
transformations; Gram-Schmidt orthogonalization, and mod-
ified Gram-Schmidt orthogonalization [68]. The LAPACK
QR factorization is a block Householder transformation
implementation.
The QR factorization is a transformation that factorizes an
m×n matrix A into its factors Q and R where Q is a unitary
matrix of size m×m and R is a triangular matrix of size m×
n. The LAPACK version of this algorithm achieves higher
performance on architectures with memory hierarchies by
accumulating a number of Householder transformations in
what is called a panel factorization which are, then, applied
all at once by means of high performance Level 3 BLAS
operations.
The LAPACK routine that performs the QR factorization
is called xGEQRF where x can be S, D, C, or Z depending on
the precision. Consider a matrix A of size m×n represented
as
A =
A
11
A
12
A
21
A
22
where A
11
is of size b × b, A
12
is of size b × (n b), A
21
is of size (m b) × b, and A
22
is of size (m b) × (n b).
The LAPACK algorithm for QR factorization can be
described as a sequence of steps where, at each step, the
transformation in Equation (1) is performed.
A =
A
11
A
12
A
21
A
22
=
V
11
V
21
,
R
11
R
12
0
˜
A
22
(1)
The transformation in Equation (1) is obtained in two steps:
1) Panel factorization. At this step a QR factorization
of the panel (A
1
) is performed as in Equation (2).
A
11
A
21
=
V
11
V
21
, (T
11
), (R
11
) (2)
This operation produces b Householder reflectors
(V
1
) and an upper triangular matrix R
11
of size b ×b,
which is a portion of the final R factor, by means
of the xDGEQR2 LAPACK routine. At this step, a
triangular matrix T
11
of size b × b is produced by
the xLARFT LAPACK routine. Note that V
11
is a
unit lower triangular matrix of size b × b. The arrays
V
1
and R
11
overwrite A
1
. Temporary workspace is
needed to store T
11
.
2) Trailing submatrix update. At this step, the transfor-
mation that was computed in the panel factorization is
applied to the trailing submatrix as shown in Equation
(3).
R
12
˜
A
22
=
I
V
11
V
21
(T
11
)(V
T
11
V
T
21
)
A
12
A
22
(3)
This operation, performed by the xLARFB LAPACK
routine, produces a portion R
12
of the final R factor
of size b × (n b) and the matrix
˜
A
22
.

GPU memory
CPU memory
CPU work
space
hP
dP T
1
T
2
Computational pattern of a typical
one-sided hybrid factorization:
1. Copy dP (GPU) to hP (CPU)
2. Factor hP on the CPU using LAPACK
3. Copy the resulting hP to dP
4. Update T
1
on the GPU using dP
5. Send next panel (part of T
1
) to the CPU
6. Start updating T
2
on the GPU
7. Start factoring the next panel on the CPU
...
Figure 1. A typical hybrid pattern of computation and communi-
cation for the one-sided matrix factorizations in MAGMA 1.0
The QR factorization is continued by applying the trans-
formation (1) to the submatrix
˜
A
22
, and then, iteratively,
until the end of the matrix A is reached.
Note that xGEQR2 and xLARFT are rich in Level 2
BLAS operations and cannot be efficiently parallelized on
currently available shared memory machines. The speed of
Level 2 BLAS computations is limited by the speed at which
the memory bus can feed the cores. On current multicore
architectures, because of the vast disproportion between the
bus bandwidth and the speed of the cores, a single core
can saturate the bus in double precision so there would be
no advantage to using additional cores for a Level 2 BLAS
operation. See [67] for more details. The LAPACK algorithm
for QR factorization can use any flavor of parallel BLAS to
exploit parallelism from the Level 3 BLAS xLARFB update
on a multicore shared-memory architecture, but the panel
update is considered a sequential operation.
III. MAGMA QR FACTORIZATION WITH A SINGLE
CORE OR SOCKET AND A SINGLE GPU
The MAGMA QR factorization with a single core or
socket and single GPU (MAGMA 1.0) differs from the
LAPACK QR factorization in 3 major respects. First, pan-
els are factorized using xGEQRF as opposed to xGEQR2.
Second, the xLARFB update is done on the GPU. Third,
the xLARFB update is done in a lookahead fashion. Subtle
differences include asynchronicity with respect to host and
device execution and with respect to memory transfers where
possible.
Figure (1) illustrates this typical pattern of hybridization.
Several groups have observed that panel factorizations are
often faster on the CPU than on the GPU [64]. Therefore,
like the LAPACK QR factorization, the panel factorization
is considered a sequential operation; there is not enough
parallelism to keep the GPU busy. Unlike the LAPACK QR
factorization, the panel factorization is done using the block
QR factorization routine xGEQRF. If parallel BLAS is being
used on the CPU, the panel factorizations are often faster
when using several cores; depending on the architecture, a
single socket or the entire host for the panel factorization
will result in the optimal performance.
MAGMA 1.0 uses static scheduling and a right-looking
version of the block QR factorization. The panel factoriza-
Figure 2. Computation splitting for new approach
tions are scheduled on the CPU using calls to LAPACK,
and the Level 3 BLAS updates on the trailing sub-matrices
are scheduled on the GPU. The trailing matrix updates are
split into 2 parts one that updates just the next panel and
a second one updating the rest, i.e., correspondingly sub-
matrices T
1
and T
2
as given in Figure (1). The next panel
update (i.e., T
1
) is done first, sent to the CPU, and the panel
factorization on the CPU is overlapped with the second part
of the trailing matrix (i.e., T
2
). This technique is known as
look-ahead and has been used before [42], [75]–[77]. Its
use enables the overlap of CPU and GPU work (and some
communications).
IV. THE NEW APPROACH
The new approach differs from MAGMA QR factorization
with a single core or socket and a single GPU (MAGMA
1.0) in two areas. First, panels factored on the CPU using
LAPACK are, instead, done in parallel using a highly opti-
mized dynamic asynchronous scheduled algorithm on some
number of CPU cores. Second, the remaining CPU cores are
used to update the rightmost panels of the matrix in parallel.
We think of the new approach as MAGMA QR factorization
with all available cores and a single GPU. While it is true
that if parallel BLAS is being used on the CPU, MAGMA
1.0 can use all available cores for the panel factorization, in
practice, little, if any, additional speedup is attained past the
number of cores on a single socket. It is a better use of the
remaining cores to be tasked for other operations.
The computation is split as in Figure (2). Assuming an
N × N matrix and a two socket, twelve core architecture,
the first N 6(OB) columns are factorized as described in
the previous section with two exceptions. First, the panels
are factorized using an optimized dynamic asynchronous

scheduled algorithm using six cores. Second, whenever a
panel factorization completes, one thread per core wakes
from a busy wait from the remaining six cores, and the
rightmost 6 panels are updated in parallel on the CPU. Note
there is a final factorization to be done corresponding to the
square in the lower right hand corner of Figure (2). The
number of cores used for the panel factorization, and the
number of cores used for the rightmost panel updates, de-
pend on the architecture, matrix size, and precision. Tuning
is discussed in detail in Section VI. Note that there are five
parameters that are tunable for the new approach: the number
of cores for panel factorization (Q), the number of cores for
rightmost panel updates (P), the panel width for rightmost
panel updates (OB), the panel width for the part of the matrix
that does not include the rightmost panel updates (NB), and
the inner panel width for panel factorizations (IB).
Figure (3) is a trace of the new approach on a single
NVIDIA Tesla M2070 GPU with fourteen cores @1.15 GHz
connected to two six-core Intel Xeon X5660 Westmere @2.8
GHz processors for single precision when the matrix size is
5920 × 5920 at the optimal parameter settings of Q = 4,
P = 8, N B = 128, OB = 172, and IB = 12. We take
GPU core to mean a single streaming multiprocessor. The
y-axis can be thought of as core number; the bottom 14
green rows are GPU cores; the next 12 rows are CPU cores.
The x-axis is time. White space is idle time. Black space
indicates a busy wait. All other space represents useful work.
The top eight rows of the graph correspond to the CPU
cores dedicated for the rightmost panel updates; again black
corresponds to a busy wait. Cyan corresponds to the up-
date. The middle four rows correspond to the CPU cores
dedicated to the panel factorization. Additionally, the core
corresponding to the ninth row from the top is responsible
for computing a xLARFT; it can be seen in the trace as a
blue rectangle. The core is also responsible for initiating
a synchronous memory copy of the result of the panel
factorization on the host to the GPU; that synchronous
memory copy is seen as a yellow rectangle. Synchronous
memory copies from the host to the GPU do not start until all
kernels that have been launched on the GPU are finished; the
call does not return until the memory copy is finished. The
bottom fourteen rows correspond to the GPU cores dedicated
to panel updates to the left of the eight rightmost panels.
The white space in the middle four rows is when the cores
dedicated for panel factorization are idle. The white space
at the right of the top eight rows depicts the fact that the
cores for rightmost panel updates are not used for the final
factorization.
The flow of the new approach can be seen in the trace.
Initially, four CPU cores are factorizing the first panel while
eight CPU cores are in a busy wait and all GPU cores are
idle. This is seen as the eight leftmost black rectangles in
the top eight rows, the leftmost block of colored rectangles
in rows nine through twelve, and the white space to the
left of the bottom fourteen green rows. Next, eight CPU
cores are woken from a busy wait and the rightmost eight
panel updates begin; this can be seen as the leftmost eight
cyan rectangles on the top eight rows. From Equation (3) it
can be seen that (V
T
11
V
T
21
)A
2
can be computed before the
xLARFT is finished. Therefore the rightmost panel updates
can be started early if they are split into two pieces, a piece
that does not use the result from the xLARFT and a piece
that does. While the computation of the first piece of the
rightmost panel updates progresses, a xLARFT is done on a
single CPU core; that can be seen as the leftmost large blue
rectangle in row nine. Because the xLARFT finishes before
the first update piece is done, both update steps are seen as a
single cyan rectangle. Once the xLARFT is done, the result
of the panel factorization is sent to the GPU; this can be seen
as the sliver of yellow to the immediate right of the leftmost
large blue rectangle in row nine. Next, the single panel to
the right of the factorized panel is updated on the GPU; this
can be seen as the leftmost thin column of green rectangles
in the bottom eight rows. Finally, the remaining panels to the
left of the eight rightmost panels, are updated on the GPU;
this can be seen as the thick column of green rectangles to
the immediate right of the leftmost thin column of green
rectangles in the bottom eight rows. This process continues
with the remaining panels to the left of the rightmost eight
panels. The last grouping of rectangles in rows nine through
twenty-six correspond to the final factorization depicted in
the lower right hand square of Figure (2).
Note that the peak single precision CPU performance
for the machine described above is 270 Gflop/s. The peak
single precision GPU performance, however, is 1030 Gflop/s
(The heights of the GPU cores in Figure (3) are scaled
accordingly.). The new approach aims to start with a busy
GPU since it is the more powerful of the two. After that, all
that can be done to keep the CPU cores busy is done. This
can be seen in Figure (3). The bottom fourteen rows show
that the GPU has very little idle time. Our approach was
to start with MAGMA 1.0 QR factorization code; this code
has been optimized for the GPU over the course of several
years. We enhanced the MAGMA 1.0 code to make better
use of the CPU cores but the GPU code was unchanged.
This approach differs from what others are currently doing
to merge GPU and multicore code. Namely, to start with
highly optimized multicore code and call GPU kernels where
possible [78].
V. QUARK DYNAMIC SCHEDULER
Panels are factorized using a highly optimized version
of the LAPACK block QR algorithm using the QUARK
dynamic scheduler to schedule the subtasks on some number
of CPU cores. We experimented with many different combi-
nations of subtask granularities and scheduling policies. We
use the version with the highest performance, and it will be
described in detail.

Figure 3. Trace of 5920 × 5920 Matrix
Individual subtasks, along with the dependencies between
subtasks, are communicated to QUARK at subtask inser-
tion time. QUARK is then free to schedule the subtasks
among available cores in any order as long as the subtask
dependencies are not violated. This concept of representing
algorithms and their execution flows as Directed Acyclic
Graphs (DAGs), where nodes represent the subtasks, and
the edges represent the dependencies among them, is nothing
new and is discussed in greater detail here [75], [76].
Optimal performance is observed when the subtasks are
composed from the operations in Section II as follows. A
single subtask is made from the xGEQR2 and the xLARFT
calls from the panel factorization step. The xLARFB call
from the trailing submatrix step is split into three subtasks.
Optimal performance is observed when the subtasks are
scheduled in a left looking fashion as opposed to the right
looking approach described in Section II. Note that the
DAG for QR block factorization with subtasks inserted in
a left looking fashion will be identical to the DAG for QR
block factorization with subtasks inserted in a right looking
fashion. QUARK allows one to give subtasks priorities and
that feature is exploited to achieve a left looking execution
order.
VI. AUTOTUNING
At first blush, it would appear that the tunable parameters
are inextricably entwined such that adjusting one parameter
can alter the effect of the other parameter settings. Therefore
an autotuning approach suggests itself whereby all possible
values of all five parameters are tried, noting what combina-
tion of parameters results in the best performance. However,
even after careful pruning, autotuning a single matrix size for
a single precision on a reasonably fast architecture will take
more than one calendar year. This is clearly unacceptable.
It turns out that autotuning burden can be greatly mitigated
by the combination of assuming orthogonality and noticing
some rules of thumb. The first rule of thumb is that the
optimal panel width, for that part of the matrix that does not
include the rightmost panel updates, is the very same panel
width already recorded for MAGMA 1.0. This panel width
exists in a lookup table at runtime and is a function of matrix
size, precision, and algorithm. The orthogonality assumption
allows one to fix this optimal panel width regardless of the
value of the other parameter settings. The second rule of
thumb is that the optimal number of cores for the panel
factorization is the number of cores on a single socket. The
third rule of thumb is that the number of cores for the
rightmost panel updates is the number of remaining cores
for large enough matrices and a linear function of the matrix
size otherwise. Experimenting with the number of cores for
panel factorizations and rightmost panel updates on different
architectures results in slight modifications to the second and
third rules of thumb, depending on the architecture, but the
end result is that the information exists in a lookup table
and is available at runtime. The fourth rule of thumb is that
IB is always 12. This number was observed over the course
of much hand tuning.
Thus the only parameter that needs to be tuned is OB, the
panel width for the rightmost panel updates. An autotuner
was written that tests a number of values for OB for every
matrix size at every precision and takes two hours to com-
plete on a reasonably fast architecture. All tuning results are

Citations
More filters
Journal ArticleDOI

A Survey of CPU-GPU Heterogeneous Computing Techniques

TL;DR: This article surveys Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency and reviews both discrete and fused CPU-GPU systems.
Journal ArticleDOI

Small-Signal Stability Analysis of Large Power Systems With Inclusion of Multiple Delays

TL;DR: The paper focuses on the small-signal stability analysis of large power systems with inclusion of multiple delayed signals and compares a Chebyshev discretization scheme of an equivalent partial differential equations and the well-known Padé approximants.
Proceedings ArticleDOI

ParK: An efficient algorithm for k-core decomposition on multicore processors

TL;DR: An experimental analysis of the algorithm of Batagelj and Zaversnik for k-core decomposition is presented and a new algorithm, ParK, is proposed that significantly reduces the working set size and minimizes the random accesses and is compared with state-of-the-art algorithm.
Proceedings ArticleDOI

Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs

TL;DR: Extensions of the Xkaapi software stack that enable to exploit full performance of a multi-GPUs system through asynchronous GPU tasks are proposed and it is demonstrated that the overlapping capability is at least as important as computing a scheduling decision to reduce completion time of a parallel program.
Journal ArticleDOI

Primal and Dual Generalized Eigenvalue Problems for Power Systems Small-Signal Stability Analysis

TL;DR: A comprehensive study of small-signal stability analysis of power systems based on matrix pencils and the generalized eigenvalue problem and the impact on the performance of the solvers of two formulations of the equations modelling the power systems is discussed.
References
More filters
Book ChapterDOI

I and J

Book

Matrix computations

Gene H. Golub
Book

Numerical Linear Algebra

Book

Applied Numerical Linear Algebra

TL;DR: The symmetric Eigenproblem and singular value decomposition and the Iterative methods for linear systems Bibliography Index.
Related Papers (5)