A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures

doi:10.1109/SAAHPC.2011.18

Mitch Horton

∗

, Stanimire Tomov

∗

and Jack Dongarra

∗†‡

∗

Department of Electrical Engineering and Computer Science

University of Tennessee

Knoxville, TN 37996

Email: {horton, tomov, dongarra}@eecs.utk.edu

†

Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee

‡

School of Mathematics & School of Computer Science, University of Manchester

Abstract—Three out of the top four supercomputers in the

November 2010 TOP500 list of the world’s most powerful

supercomputers use NVIDIA GPUs to accelerate computations.

Ninety-ﬁve systems from the list are using processors with six

or more cores. Three-hundred-sixty-ﬁve systems use quad-core

processor-based systems. Thirty-seven systems are using dual-

core processors. The large-scale enabling of hybrid graphics

processing unit (GPU)-based multicore platforms for compu-

tational science by developing fundamental numerical libraries

(in particular, libraries in the area of dense linear algebra) for

them has been underway for some time. We present a class

of algorithms based largely on software infrastructures that

have already been developed for homogeneous multicores and

hybrid GPU-based computing. The algorithms extend what is

currently available in the Matrix Algebra for GPU and Multicore

Architectures (MAGMA) Library for performing Cholesky, QR,

and LU factorizations using a single core or socket and a single

GPU. The extensions occur in two areas. First, panels factored

on the CPU using LAPACK are, instead, done in parallel using

a highly optimized dynamic asynchronous scheduled algorithm

on some number of CPU cores. Second, the remaining CPU

cores are used to update the rightmost panels of the matrix in

parallel.

Keywords-GPU; multicore; QR; LU; Cholesky;

I. INTRODUCTION

Until roughly 2004, microprocessor manufacturers were

able to achieve higher performance by exploiting higher

degrees of instruction level parallelism (ILP). Based on this

approach, several generations of processors were built where

clock frequencies were higher and higher and pipelines were

deeper and deeper. As a result, applications could beneﬁt

from these innovations and achieve higher performance

simply by relying on compilers that could efﬁciently exploit

ILP. Due to a number of physical limitations (mostly power

consumption and heat dissipation) this approach cannot be

pushed any further. For this reason, chip designers have

moved their focus from ILP to thread level parallelism (TLP)

where higher performance can be achieved by replicating

execution units (or cores) on the die while keeping the

clock rates in a range where power consumption and heat

dissipation do not represent a problem [1]–[3]. CPU designs

have moved to multicores and are currently going through

a renaissance due to the need for new approaches to man-

age the exponentially increasing (a) appetite for power of

conventional system designs, and (b) gap between compute

and communication speeds. Compute Uniﬁed Device Ar-

chitecture (CUDA) [4] based multicore platforms stand out

among a conﬂuence of trends because of their low power

consumption and, at the same time, high compute power

and bandwidth [3]. Because of the prevalence of multicore

and GPU architectures in the TOP500 list [5]; the existence

of current conferences and workshops with emphasis on

multicore and GPU technology [6]–[33]; the long list of

GPU related success stories across academia, industry, and

national research laboratories for speciﬁc applications and

algorithms [34]–[46]; books related to general purpose GPU

computing [47]–[49]; the emergence of compilers that un-

derstand GPU directives [50]–[53]; language in the current

Exascale roadmap concerning heterogeneity in general and

general purpose GPU programming in particular [54]; the

fact that NVIDIA did $100 million in revenue from high

performance computing last year, up from zero three years

ago [55]; and relentless architectural advancements [56]–

[63], it is clear that multicore processors and GPUs represent

the future of high performance computing.

As multicore and GPU systems continue to gain ground

in the high performance computing world, linear algebra

algorithms have to be reformulated, or new algorithms

have to be developed, in order to take advantage of the

architectural features on these new architectures [64]. This

work is a contribution to the development of these algorithms

in the area of dense linear algebra, and will be included

in the Matrix Algebra for GPU and Multicore Architec-

tures (MAGMA) Library [38]. Designed to be similar to

LAPACK [65] in functionality, data storage, and interface,

the MAGMA library allows scientists to effortlessly port

their LAPACK-relying software components and to take

advantage of the new hybrid architectures.

The challenges in developing scalable high performance

algorithms for multicore with GPU accelerators systems

stem from their heterogeneity, massive parallelism, and the

huge gap between the GPUs’ compute power vs. the CPU-

GPU communication speed. We show an approach that is

largely based on software infrastructures that have already

been developed – namely, the QUeuing And Runtime for

Kernels (QUARK) dynamic scheduler [66] and the MAGMA

[38] library. The approach extends what is currently avail-

able in the MAGMA Library for performing Cholesky, QR,

and LU factorizations using a single core or socket and a

single GPU. The extensions occur in two areas. First, panels

factored on the CPU using LAPACK are, instead, done in

parallel using a highly optimized dynamic asynchronous

QUARK scheduled algorithm on some number of CPU

cores. Second, the remaining CPU cores are used to update

the rightmost panels of the matrix in parallel. The approach

aims to better utilize all available hardware.

The results of this work are communicated using the

QR algorithm as a framework. The Cholesky and LU

algorithms are similar in implementation. The paper is

organized as follows. Section II provides an overview of

the QR factorization. Section III illustrates how the QR

factorization is performed by the MAGMA library using

a single core or socket and a single GPU. Section IV

describes the new approach, outlining how it differs from

what is currently available in the MAGMA library. Section V

brieﬂy describes the QUARK dynamic scheduler. Section VI

discusses autotuning. In particular, an explanation is given

of how to — for a given matrix size, precision, architecture,

and algorithm — choose the optimal number of cores for

panel factorization, number of cores for panel updates, panel

width, outer panel width, and inner panel width. Section

VII describes algorithm optimization with respect to panel

factorization. Section VIII presents results on two different

architectures: a single NVIDIA GeForce GTX480 GPU

with ﬁfteen cores (streaming multiprocessors) @1.401 GHz

connected to eight six-core Intel Xeon X5660 Westmere

@2.8 GHz processors and a single NVIDIA Telsa M2070

GPU with fourteen cores (streaming mulitprocessors) @1.15

GHz connected to two six-core Intel Xeon X5660 Westmere

@2.8 GHz processors. Finally, section IX discusses future

work.

II. BLOCK QR FACTORIZATION

This section contains a high level explanation of the block

QR factorization implemented by LAPACK. The explanation

will facilitate an understanding of the description of the new

approach given in Section IV. A detailed discussion of the

block QR factorization can be found here [67]–[71].

Stewart refers to the QR factorization as, ”the great

success story of modern computation,” [72]. Trefethen and

Bau say, ”One algorithmic idea in numerical linear algebra

is more important than all the others: QR factorization [70].”

It is used for solving linear systems of equations [64], [68],

solving the linear least squares problem [65], [73], com-

puting eigenvalues and eigenvectors [72], [74], computing

the SVD [72], [74], and computing an orthonormal basis

for a set of vectors [68]. Stewart says, ”the underlying

theory of the method continues to suggest new algorithms.”

[72] Golub and Van Loan give QR algorithms based on

Householder, block Householder, Givens, and fast Givens

transformations; Gram-Schmidt orthogonalization, and mod-

iﬁed Gram-Schmidt orthogonalization [68]. The LAPACK

QR factorization is a block Householder transformation

implementation.

The QR factorization is a transformation that factorizes an

m×n matrix A into its factors Q and R where Q is a unitary

matrix of size m×m and R is a triangular matrix of size m×

n. The LAPACK version of this algorithm achieves higher

performance on architectures with memory hierarchies by

accumulating a number of Householder transformations in

what is called a panel factorization which are, then, applied

all at once by means of high performance Level 3 BLAS

operations.

The LAPACK routine that performs the QR factorization

is called xGEQRF where x can be S, D, C, or Z depending on

the precision. Consider a matrix A of size m×n represented

as

A =



A

11

A

12

A

21

A

22



where A

11

is of size b × b, A

12

is of size b × (n − b), A

21

is of size (m − b) × b, and A

22

is of size (m − b) × (n − b).

The LAPACK algorithm for QR factorization can be

described as a sequence of steps where, at each step, the

transformation in Equation (1) is performed.

A =



A

11

A

12

A

21

A

22



=⇒



V

11

V

21



,



R

11

R

12

0

˜

A

22



(1)

The transformation in Equation (1) is obtained in two steps:

1) Panel factorization. At this step a QR factorization

of the panel (A

∗1

) is performed as in Equation (2).



A

11

A

21



=⇒



V

11

V

21



, (T

11

), (R

11

) (2)

This operation produces b Householder reﬂectors

(V

∗1

) and an upper triangular matrix R

11

of size b ×b,

which is a portion of the ﬁnal R factor, by means

of the xDGEQR2 LAPACK routine. At this step, a

triangular matrix T

11

of size b × b is produced by

the xLARFT LAPACK routine. Note that V

11

is a

unit lower triangular matrix of size b × b. The arrays

V

∗1

and R

11

overwrite A

∗1

. Temporary workspace is

needed to store T

11

.

2) Trailing submatrix update. At this step, the transfor-

mation that was computed in the panel factorization is

applied to the trailing submatrix as shown in Equation

(3).



R

12

˜

A

22



=



I −



V

11

V

21



(T

11

)(V

T

11

V

T

21

)



A

12

A

22



(3)

This operation, performed by the xLARFB LAPACK

routine, produces a portion R

12

of the ﬁnal R factor

of size b × (n − b) and the matrix

˜

A

22

.

GPU memory

CPU memory

CPU work

space

hP

dP T

1

T

2

Computational pattern of a typical

one-sided hybrid factorization:

1. Copy dP (GPU) to hP (CPU)

2. Factor hP on the CPU using LAPACK

3. Copy the resulting hP to dP

4. Update T

1

on the GPU using dP

5. Send next panel (part of T

1

) to the CPU

6. Start updating T

2

on the GPU

7. Start factoring the next panel on the CPU

...

Figure 1. A typical hybrid pattern of computation and communi-

cation for the one-sided matrix factorizations in MAGMA 1.0

The QR factorization is continued by applying the trans-

formation (1) to the submatrix

˜

A

22

, and then, iteratively,

until the end of the matrix A is reached.

Note that xGEQR2 and xLARFT are rich in Level 2

BLAS operations and cannot be efﬁciently parallelized on

currently available shared memory machines. The speed of

Level 2 BLAS computations is limited by the speed at which

the memory bus can feed the cores. On current multicore

architectures, because of the vast disproportion between the

bus bandwidth and the speed of the cores, a single core

can saturate the bus in double precision so there would be

no advantage to using additional cores for a Level 2 BLAS

operation. See [67] for more details. The LAPACK algorithm

for QR factorization can use any ﬂavor of parallel BLAS to

exploit parallelism from the Level 3 BLAS xLARFB update

on a multicore shared-memory architecture, but the panel

update is considered a sequential operation.

III. MAGMA QR FACTORIZATION WITH A SINGLE

CORE OR SOCKET AND A SINGLE GPU

The MAGMA QR factorization with a single core or

socket and single GPU (MAGMA 1.0) differs from the

LAPACK QR factorization in 3 major respects. First, pan-

els are factorized using xGEQRF as opposed to xGEQR2.

Second, the xLARFB update is done on the GPU. Third,

the xLARFB update is done in a lookahead fashion. Subtle

differences include asynchronicity with respect to host and

device execution and with respect to memory transfers where

possible.

Figure (1) illustrates this typical pattern of hybridization.

Several groups have observed that panel factorizations are

often faster on the CPU than on the GPU [64]. Therefore,

like the LAPACK QR factorization, the panel factorization

is considered a sequential operation; there is not enough

parallelism to keep the GPU busy. Unlike the LAPACK QR

factorization, the panel factorization is done using the block

QR factorization routine xGEQRF. If parallel BLAS is being

used on the CPU, the panel factorizations are often faster

when using several cores; depending on the architecture, a

single socket or the entire host for the panel factorization

will result in the optimal performance.

MAGMA 1.0 uses static scheduling and a right-looking

version of the block QR factorization. The panel factoriza-

Figure 2. Computation splitting for new approach

tions are scheduled on the CPU using calls to LAPACK,

and the Level 3 BLAS updates on the trailing sub-matrices

are scheduled on the GPU. The trailing matrix updates are

split into 2 parts — one that updates just the next panel and

a second one updating the rest, i.e., correspondingly sub-

matrices T

1

and T

2

as given in Figure (1). The next panel

update (i.e., T

1

) is done ﬁrst, sent to the CPU, and the panel

factorization on the CPU is overlapped with the second part

of the trailing matrix (i.e., T

2

). This technique is known as

look-ahead and has been used before [42], [75]–[77]. Its

use enables the overlap of CPU and GPU work (and some

communications).

IV. THE NEW APPROACH

The new approach differs from MAGMA QR factorization

with a single core or socket and a single GPU (MAGMA

1.0) in two areas. First, panels factored on the CPU using

LAPACK are, instead, done in parallel using a highly opti-

mized dynamic asynchronous scheduled algorithm on some

number of CPU cores. Second, the remaining CPU cores are

used to update the rightmost panels of the matrix in parallel.

We think of the new approach as MAGMA QR factorization

with all available cores and a single GPU. While it is true

that if parallel BLAS is being used on the CPU, MAGMA

1.0 can use all available cores for the panel factorization, in

practice, little, if any, additional speedup is attained past the

number of cores on a single socket. It is a better use of the

remaining cores to be tasked for other operations.

The computation is split as in Figure (2). Assuming an

N × N matrix and a two socket, twelve core architecture,

the ﬁrst N − 6(OB) columns are factorized as described in

the previous section with two exceptions. First, the panels

are factorized using an optimized dynamic asynchronous

scheduled algorithm using six cores. Second, whenever a

panel factorization completes, one thread per core wakes

from a busy wait from the remaining six cores, and the

rightmost 6 panels are updated in parallel on the CPU. Note

there is a ﬁnal factorization to be done corresponding to the

square in the lower right hand corner of Figure (2). The

number of cores used for the panel factorization, and the

number of cores used for the rightmost panel updates, de-

pend on the architecture, matrix size, and precision. Tuning

is discussed in detail in Section VI. Note that there are ﬁve

parameters that are tunable for the new approach: the number

of cores for panel factorization (Q), the number of cores for

rightmost panel updates (P), the panel width for rightmost

panel updates (OB), the panel width for the part of the matrix

that does not include the rightmost panel updates (NB), and

the inner panel width for panel factorizations (IB).

Figure (3) is a trace of the new approach on a single

NVIDIA Tesla M2070 GPU with fourteen cores @1.15 GHz

connected to two six-core Intel Xeon X5660 Westmere @2.8

GHz processors for single precision when the matrix size is

5920 × 5920 at the optimal parameter settings of Q = 4,

P = 8, N B = 128, OB = 172, and IB = 12. We take

GPU core to mean a single streaming multiprocessor. The

y-axis can be thought of as core number; the bottom 14

green rows are GPU cores; the next 12 rows are CPU cores.

The x-axis is time. White space is idle time. Black space

indicates a busy wait. All other space represents useful work.

The top eight rows of the graph correspond to the CPU

cores dedicated for the rightmost panel updates; again black

corresponds to a busy wait. Cyan corresponds to the up-

date. The middle four rows correspond to the CPU cores

dedicated to the panel factorization. Additionally, the core

corresponding to the ninth row from the top is responsible

for computing a xLARFT; it can be seen in the trace as a

blue rectangle. The core is also responsible for initiating

a synchronous memory copy of the result of the panel

factorization on the host to the GPU; that synchronous

memory copy is seen as a yellow rectangle. Synchronous

memory copies from the host to the GPU do not start until all

kernels that have been launched on the GPU are ﬁnished; the

call does not return until the memory copy is ﬁnished. The

bottom fourteen rows correspond to the GPU cores dedicated

to panel updates to the left of the eight rightmost panels.

The white space in the middle four rows is when the cores

dedicated for panel factorization are idle. The white space

at the right of the top eight rows depicts the fact that the

cores for rightmost panel updates are not used for the ﬁnal

factorization.

The ﬂow of the new approach can be seen in the trace.

Initially, four CPU cores are factorizing the ﬁrst panel while

eight CPU cores are in a busy wait and all GPU cores are

idle. This is seen as the eight leftmost black rectangles in

the top eight rows, the leftmost block of colored rectangles

in rows nine through twelve, and the white space to the

left of the bottom fourteen green rows. Next, eight CPU

cores are woken from a busy wait and the rightmost eight

panel updates begin; this can be seen as the leftmost eight

cyan rectangles on the top eight rows. From Equation (3) it

can be seen that (V

T

11

V

T

21

)A

∗2

can be computed before the

xLARFT is ﬁnished. Therefore the rightmost panel updates

can be started early if they are split into two pieces, a piece

that does not use the result from the xLARFT and a piece

that does. While the computation of the ﬁrst piece of the

rightmost panel updates progresses, a xLARFT is done on a

single CPU core; that can be seen as the leftmost large blue

rectangle in row nine. Because the xLARFT ﬁnishes before

the ﬁrst update piece is done, both update steps are seen as a

single cyan rectangle. Once the xLARFT is done, the result

of the panel factorization is sent to the GPU; this can be seen

as the sliver of yellow to the immediate right of the leftmost

large blue rectangle in row nine. Next, the single panel to

the right of the factorized panel is updated on the GPU; this

can be seen as the leftmost thin column of green rectangles

in the bottom eight rows. Finally, the remaining panels to the

left of the eight rightmost panels, are updated on the GPU;

this can be seen as the thick column of green rectangles to

the immediate right of the leftmost thin column of green

rectangles in the bottom eight rows. This process continues

with the remaining panels to the left of the rightmost eight

panels. The last grouping of rectangles in rows nine through

twenty-six correspond to the ﬁnal factorization depicted in

the lower right hand square of Figure (2).

Note that the peak single precision CPU performance

for the machine described above is 270 Gﬂop/s. The peak

single precision GPU performance, however, is 1030 Gﬂop/s

(The heights of the GPU cores in Figure (3) are scaled

accordingly.). The new approach aims to start with a busy

GPU since it is the more powerful of the two. After that, all

that can be done to keep the CPU cores busy is done. This

can be seen in Figure (3). The bottom fourteen rows show

that the GPU has very little idle time. Our approach was

to start with MAGMA 1.0 QR factorization code; this code

has been optimized for the GPU over the course of several

years. We enhanced the MAGMA 1.0 code to make better

use of the CPU cores but the GPU code was unchanged.

This approach differs from what others are currently doing

to merge GPU and multicore code. Namely, to start with

highly optimized multicore code and call GPU kernels where

possible [78].

V. QUARK DYNAMIC SCHEDULER

Panels are factorized using a highly optimized version

of the LAPACK block QR algorithm using the QUARK

dynamic scheduler to schedule the subtasks on some number

of CPU cores. We experimented with many different combi-

nations of subtask granularities and scheduling policies. We

use the version with the highest performance, and it will be

described in detail.

Figure 3. Trace of 5920 × 5920 Matrix

Individual subtasks, along with the dependencies between

subtasks, are communicated to QUARK at subtask inser-

tion time. QUARK is then free to schedule the subtasks

among available cores in any order as long as the subtask

dependencies are not violated. This concept of representing

algorithms and their execution ﬂows as Directed Acyclic

Graphs (DAGs), where nodes represent the subtasks, and

the edges represent the dependencies among them, is nothing

new and is discussed in greater detail here [75], [76].

Optimal performance is observed when the subtasks are

composed from the operations in Section II as follows. A

single subtask is made from the xGEQR2 and the xLARFT

calls from the panel factorization step. The xLARFB call

from the trailing submatrix step is split into three subtasks.

Optimal performance is observed when the subtasks are

scheduled in a left looking fashion as opposed to the right

looking approach described in Section II. Note that the

DAG for QR block factorization with subtasks inserted in

a left looking fashion will be identical to the DAG for QR

block factorization with subtasks inserted in a right looking

fashion. QUARK allows one to give subtasks priorities and

that feature is exploited to achieve a left looking execution

order.

VI. AUTOTUNING

At ﬁrst blush, it would appear that the tunable parameters

are inextricably entwined such that adjusting one parameter

can alter the effect of the other parameter settings. Therefore

an autotuning approach suggests itself whereby all possible

values of all ﬁve parameters are tried, noting what combina-

tion of parameters results in the best performance. However,

even after careful pruning, autotuning a single matrix size for

a single precision on a reasonably fast architecture will take

more than one calendar year. This is clearly unacceptable.

It turns out that autotuning burden can be greatly mitigated

by the combination of assuming orthogonality and noticing

some rules of thumb. The ﬁrst rule of thumb is that the

optimal panel width, for that part of the matrix that does not

include the rightmost panel updates, is the very same panel

width already recorded for MAGMA 1.0. This panel width

exists in a lookup table at runtime and is a function of matrix

size, precision, and algorithm. The orthogonality assumption

allows one to ﬁx this optimal panel width regardless of the

value of the other parameter settings. The second rule of

thumb is that the optimal number of cores for the panel

factorization is the number of cores on a single socket. The

third rule of thumb is that the number of cores for the

rightmost panel updates is the number of remaining cores

for large enough matrices and a linear function of the matrix

size otherwise. Experimenting with the number of cores for

panel factorizations and rightmost panel updates on different

architectures results in slight modiﬁcations to the second and

third rules of thumb, depending on the architecture, but the

end result is that the information exists in a lookup table

and is available at runtime. The fourth rule of thumb is that

IB is always 12. This number was observed over the course

of much hand tuning.

Thus the only parameter that needs to be tuned is OB, the

panel width for the rightmost panel updates. An autotuner

was written that tests a number of values for OB for every

matrix size at every precision and takes two hours to com-

plete on a reasonably fast architecture. All tuning results are

A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures

Figures

Citations

A Survey of CPU-GPU Heterogeneous Computing Techniques

Small-Signal Stability Analysis of Large Power Systems With Inclusion of Multiple Delays

ParK: An efficient algorithm for k-core decomposition on multicore processors

Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs

Primal and Dual Generalized Eigenvalue Problems for Power Systems Small-Signal Stability Analysis

References

I and J

Matrix computations

Matrix computations (3rd ed.)

Numerical Linear Algebra

Applied Numerical Linear Algebra

Related Papers (5)

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

A class of parallel tiled linear algebra algorithms for multicore architectures

Dense linear algebra solvers for multicore with GPU accelerators

Towards dense linear algebra for hybrid GPU accelerated manycore systems

Dynamically scheduled Cholesky factorization on multicore architectures with GPU accelerators