What is the way to achieve peak performance?

If peak performance is to be achieved, the instruction pipeline that is capable of executing floating point operations should be executing a fused multiply accumulate instruction (FMA) as often as possible.

What is the reason for parallelizing the inner loops instead of the outer ones?

But more importantly, parallelizing the inner loops instead of the outer loops engenders better spatial locality, as there will be one contiguous block of memory, instead of several blocks of memory that may not be contiguous.

How many threads can be used on the PowerPC A2?

While there are fewer threads to use on the PowerPC A2 than on the Xeon Phi, 64 hardware threads is still enough to require the parallelization of multiple loops.

Why is the Xeon Phi a micro-kernel?

Because of the highly parallel nature of the Intel Xeon Phi, the micro-kernel must be designed while keeping the parallelism gained from the core-sharing hardware threads in mind.

What is the way to perform a rank-k update?

When k is just slightly larger than a multiple of 240, an integer number of rank-k updates will be performed with the optimal blocksize kc, and one rank-k update will be performed with a smaller rank.

What is the cost of a rank-k update?

The rank-k update with a small value of k is expensive because in each micro-kernel call, an mr × nr block of C must be both read from and written to main memory.

What is the advantage of the jc loop?

The jc loop: Since the Xeon Phi lacks an L3 cache, this loop provides no advantage over the jr loop for parallelizing in the n dimension.

What is the advantage of parallelizing the jr loop?

parallelizing the jr loop and synchronizing the four hardware threads will reduce bandwidth requirements of the micro-kernel.

What is the reason for the L1 cache size?

A curiosity is that on both of these architectures the L1 cache is too small to support the multiple hardware threads that are required to attain near-peak performance.

(Open Access) Anatomy of High-Performance Many-Threaded Matrix Multiplication (2014) | Tyler M. Smith

Q: What are the contributions in "Anatomy of high-performance many-threaded matrix multiplication" ?

The authors describe how BLIS extends the “ GotoBLAS approach ” to implementing matrix multiplication ( GEMM ). The authors discuss how this facilitates a finer level of parallelism that greatly simplifies the multithreading of GEMM as well as additional opportunities for parallelizing multiple loops. Specifically, the authors show that with the advent of many-core architectures such as the IBM PowerPC A2 processor ( used by Blue Gene/Q ) and the Intel Xeon Phi processor, parallelizing both within and around the inner kernel, as the BLIS approach supports, is not only convenient, but also necessary for scalability. The resulting implementations deliver what the authors believe to be the best open source performance for these architectures, achieving both impressive performance and excellent scalability.

Anatomy of High-Performance Many-Threaded

Matrix Multiplication

Tyler M. Smith

∗

, Robert van de Geijn

∗

, Mikhail Smelyanskiy

†

, Jeff R. Hammond

‡

and Field G. Van Zee

∗

Institute for Computational Engineering and Sciences and Department of Computer Science

The University of Texas at Austin, Austin TX, 78712

Email: tms,rvdg,ﬁeld@cs.utexas.edu

†

Parallel Computing Lab Intel Corporation

Santa Clara, CA 95054

Email: mikhail.smelyanskiy@intel.com

‡

Leadership Computing Facility Argonne National Lab

Argonne, IL 60439

Email: jhammond@alcf.anl.gov

Abstract—BLIS is a new framework for rapid instantiation

of the BLAS. We describe how BLIS extends the “GotoBLAS

approach” to implementing matrix multiplication (GEMM). While

GEMM was previously implemented as three loops around an

inner kernel, BLIS exposes two additional loops within that inner

kernel, casting the computation in terms of the BLIS micro-

kernel so that porting GEMM becomes a matter of customizing

this micro-kernel for a given architecture. We discuss how this

facilitates a ﬁner level of parallelism that greatly simpliﬁes the

multithreading of GEMM as well as additional opportunities for

parallelizing multiple loops. Speciﬁcally, we show that with the

advent of many-core architectures such as the IBM PowerPC

A2 processor (used by Blue Gene/Q) and the Intel Xeon Phi

processor, parallelizing both within and around the inner kernel,

as the BLIS approach supports, is not only convenient, but also

necessary for scalability. The resulting implementations deliver

what we believe to be the best open source performance for

these architectures, achieving both impressive performance and

excellent scalability.

Index Terms—linear algebra, libraries, high-performance, ma-

trix, BLAS, multicore

I. INTRODUCTION

High-performance implementation of matrix-matrix multi-

plication (GEMM) is both of great practical importance, since

many computations in scientiﬁc computing can be cast in

terms of this operation, and of pedagogical importance, since

it is often used to illustrate how to attain high performance

on a novel architecture. A few of the many noteworthy

papers from the past include Agarwal et al. [1] (an early

paper that showed how an implementation in a high level

language—Fortran—can attain high performance), Bilmer et

al. [2] (which introduced auto-tuning and code generation

using the C programming language), Whaley and Dongarra [3]

(which productized the ideas behind PHiPAC), K

agst

om et

al. [4] (which showed that the level-3 BLAS operations can

be implemented in terms of the general rank-k update (GEMM

)), and Goto and van de Geijn [5] (which described what

is currently accepted to be the most effective approach to

implementation, which we will call the GotoBLAS approach).

Very recently, we introduced the BLAS-like Library In-

stantiation Software (BLIS) [6] which can be viewed as a

systematic reimplementation of the GotoBLAS, but with a

number of key insights that greatly reduce the effort for the

library developer. The primary innovation is the insight that

the inner kernel—the smallest unit of computation within the

GotoBLAS GEMM implementation—can be further simpliﬁed

into two loops around a micro-kernel. This means that the li-

brary developer needs only implement and optimize a routine

that implements the computation of C := AB+C where C is a

small submatrix that ﬁts in the registers of a target architecture.

In a second paper [7], we reported experiences regarding

portability and performance on a large number of current

processors. Most of that paper is dedicated to implementation

and performance on a single core. A brief demonstration of

how BLIS also supports parallelism was included in that paper,

but with few details.

The present paper describes in detail the opportunities for

parallelism exposed by the BLIS implementation of GEMM. It

focuses speciﬁcally on how this supports high performance and

scalability when targeting many-core architectures that require

more threads than cores if near-peak performance is to be

attained. Two architectures are examined: the PowerPC A2

processor with 16 cores that underlies IBM’s Blue Gene/Q

supercomputer, which supports four-way hyperthreading for a

total of 64 threads; and the Intel Xeon Phi processor with 60

cores

and also supports four-way hyperthreading for a total of

240 threads. It is demonstrated that excellent performance and

scalability can be achieved speciﬁcally because of the extra

parallelism that is exposed by the BLIS approach within the

inner kernel employed by the GotoBLAS approach.

It is also shown that when many threads are employed

it is necessary to parallelize in multiple dimensions. This

builds upon Marker et al. [8], which we believe was the

ﬁrst paper to look at 2D work decomposition for GEMM

on multithreaded architectures. The paper additionally builds

upon work that describe the vendor implementations for the

This micro-kernel routine is usually written in assembly code, but may

also be expressed in C with vector intrinsics.

In theory, 61 cores can be used for computation. In practice, 60 cores are

usually employed.

Main Memory

L3 cache

L2 cache

L1 cache

registers

Fig. 1. Illustration of which parts of the memory hierarchy each block of A and B reside in during the execution of the micro-kernel.

PowerPC A2 [9] and the Xeon Phi [10]. BLIS wraps many of

those insights up in a cleaner framework so that exploration of

the algorithmic design space is, in our experience, simpliﬁed.

We show performance to be competitive relative to that of

Intel’s Math Kernel Library (MKL) and IBM’s Engineering

and Scientiﬁc Subroutine Library (ESSL)

II. BLIS

In our discussions in this paper, we focus on the special

case C := AB + C, where A, B, and C are m × k, k × n,

and m × n, respectively.

It helps to be familiar with the

GotoBLAS approach to implementing GEMM, as described

in [5]. We will brieﬂy review the BLIS approach for a single

core implementation in this section, with the aid of Figure 1.

Our description starts with the outer-most loop, indexed by

. This loop partitions C and B into (wide) column panels.

Next, A and the current column panel of B are partitioned into

column panels and row panels, respectively, so that the current

column panel of C (of width n

) is updated as a sequence of

rank-k updates (with k = k

), indexed by p

. At this point, the

GotoBLAS approach packs the current row panel of B into a

We do not compare to OpenBLAS [11] as there is no implementation for

either the PowerPC A2 or the Xeon Phi, to our knowledge. ATLAS does not

support either architecture under consideration in this paper so no comparison

can be made.

We will also write this operation as C += AB.

contiguous buffer,

B. If there is an L3 cache, the computation

is arranged to try to keep

B in the L3 cache. The primary

reason for the outer-most loop, indexed by j

, is to limit the

amount of workspace required for

B, with a secondary reason

to allow

B to remain in the L3 cache.

Now, the current panel of A is partitioned into blocks,

indexed by i

, that are packed into a contiguous buffer,

The block is sized to occupy a substantial part of the L2

cache, leaving enough space to ensure that other data does not

evict the block. The GotoBLAS approach then implements the

“block-panel” multiplication of

B as its inner kernel, making

this the basic unit of computation. It is here that the BLIS

approach continues to mimic the GotoBLAS approach, except

that it explicitly exposes two additional loops. In BLIS, these

loops are coded portably in C, whereas in GotoBLAS they are

hidden within the implementation of the inner kernel (which

is oftentimes assembly-coded).

At this point, we have

A in the L2 cache and

B in the L3

cache (or main memory). The next loop, indexed by j

, now

partitions

B into column “slivers” (micro-panels) of width n

At a typical point of the computation, one such sliver is in

the L1 cache, being multiplied by

A. Panel

B was packed in

such a way that this sliver is stored contiguously, one row (of

The primary advantage of constraining

B to the L3 cache is that it is

cheaper to access memory in terms of energy efﬁciency in the L3 cache

rather than main memory.

for j

= 0, . . . , n − 1 in steps of n

for p

= 0, . . . , k − 1 in steps of k

for i

= 0, . . . , m − 1 in steps of m

for j

= 0, . . . , n

− 1 in steps of n

for i

= 0, . . . , m

− 1 in steps of m

C(i

−1, j

−1)

· · ·

endfor











(loop around micro-kernel)

endfor

for j

= 0, . . . , n − 1 in steps of n

for p

= 0, . . . , k − 1 in steps of k

for i

= 0, . . . , m − 1 in steps of m

for j

= 0, . . . , n

− 1 in steps of n

for i

= 0, . . . , m

− 1 in steps of m

C(i

−1, j

−1)

· · ·

endfor











endfor

for j

= 0, . . . , n − 1 in steps of n

for p

= 0, . . . , k − 1 in steps of k

for i

= 0, . . . , m − 1 in steps of m

for j

= 0, . . . , n

− 1 in steps of n

for i

= 0, . . . , m

− 1 in steps of m

C(i

−1, j

−1)

· · ·

endfor











endfor

Fig. 2. Illustration of the three inner-most loops. The loops idexed by i

and j

are the loops that were hidden inside the GotoBLAS inner kernel.

width n

) at a time. Finally, the inner-most loop, indexed by

, partitions

A into row slivers of height m

. Block

A was

packed in such a way that this sliver is stored contiguously,

one column (of height m

) at a time. The BLIS micro-kernel

then multiplies the current sliver of

A by the current sliver

B to update the corresponding m

× n

block of C. This

micro-kernel performs a sequence of rank-1 updates (outer

products) with columns from the sliver of

A and rows from

the sliver of

A typical point in the compution is now captured by

Figure 1. A m

× n

block of C is in the registers. A k

× n

sliver of

B is in the L1 cache. The m

× k

sliver of

A is

streamed from the L2 cache. And so forth. The key takeaway

here is that the layering described in this section can be

captured by the ﬁve nested loops around the micro-kernel in

Figure 2.

III. OPPORTUNITIES FOR PARALLELISM

We have now set the stage to discuss opportunities for

parallelism and when those opportunites may be advantageous.

There are two key insights in this section:

• In GotoBLAS, the inner kernel is the basic unit of

computation and no parallelization is incorporated within

that inner kernel

. The BLIS framework exposes two

It is, of course, possible that more recent implementations by Goto deviate

from this. However, these implementations are proprietary.

loops within that inner kernel, thus exposing two extra

opportunities for parallelism, for a total of ﬁve.

• It is important to use a given memory layer wisely. This

gives guidance as to which loop should be parallelized.

A. Parallelism within the micro-kernel

The micro-kernel is typically implemented as a sequence of

rank-1 updates of the m

× n

block of C that is accumulated

in the registers. Introducing parallelism over the loop around

these rank-1 updates is ill-advised for three reasons: (1) the

unit of computation is small, making the overhead consider-

able, (2) the different threads would accumulate contributions

to the block of C, requiring a reduction across threads that is

typically costly, and (3) each thread does less computation for

each update of the m

× n

block of C, so the amortization

of the cost of the update is reduced.

This merely means that parallelizing the loop around the

rank-1 updates is not advisable. One could envision carefully

parallezing the micro-kernel in other ways for a core that re-

quires hyperthreading in order to attain peak performance. But

that kind of parallelism can be described as some combination

of parallelizing the ﬁrst and second loop around the micro-

kernel. We will revisit this topic later on.

The key for this paper is that the micro-kernel is a basic unit

of computation for BLIS. We focus on how to get parallelism

without having to touch that basic unit of computation.

B. Parallelizing the ﬁrst loop around the micro-kernel (in-

dexed by i

Fig. 3. Left: the micro-kernel. Right: the ﬁrst loop around the micro-kernel.

Let us consider the ﬁrst of the three loops in Figure 2. If

one parallelizes the ﬁrst loop around the micro-kernel (indexed

by i

), different instances of the micro-kernel are assigned to

different theads. Our objective is to optimally use fast memory

resources. In this case, the different threads share the same

sliver of

B, which resides in the L1 cache.

Notice that regardless of the size of the matrices on which

we operate, this loop has a ﬁxed number of iterations, d

since it loops over m

in steps of m

. Thus, the amount of

parallelism that can be extracted from this loop is quite limited.

Additionally, a sliver of

B is brought from the L3 cache into

the L1 cache and then used during each iteration of this loop.

When parallelized, less time is spent in this loop and thus the

cost of bringing that sliver of

B into the L1 cache is amortized

over less computation. Notice that the cost of bringing

B into

the L1 cache may be overlapped by computation, so it may be

completely or partially hidden. In this case, there is a minimum

amount of computation required to hide the cost of bringing

into the L1 cache. Thus, parallelizing is acceptable only when

this loop has a large number of iterations. These two factors

mean that this loop should be parallelized only when the ratio

of m

to m

is large. Unfortunately, this is not usually the case,

as m

is usually on the order of a few hundred elements.

C. Parallelizing the second loop around the micro-kernel

(indexed by j

Fig. 4. The second loop around the micro-kernel.

Now consider the second of the loops in Figure 2. If one

parallelizes the second loop around the micro-kernel (indexed

by j

), each thread will be assigned a different sliver of

which resides in the L1 cache, and they will all share the same

block of

A, which resides in the L2 cache. Then, each thread

will multiply the block of

A with its own sliver of

Similar to the ﬁrst loop around the micro-kernel, this loop

has a ﬁxed number of iterations, as it iterates over n

in steps

of n

. The time spent in this loop amortizes the cost of packing

the block of

A from main memory into the L2 cache. Thus,

for similar reasons as the ﬁrst loop around the micro-kernel,

this loop should be parallelized only if the ratio of n

to n

is large. Fortunately, this is almost always the case, as n

typically on the order of several thousand elements.

Consider the case where this loop is parallelized and each

thread shares a single L2 cache. Here, one block

A will be

moved into the L2 cache, and there will be several slivers of

B which also require space in the cache. Thus, it is possible

that either

A or the slivers of

B will have to be resized so

that all ﬁt into the cache simultaneously. However, slivers of

B are small compared to the size of the L2 cache, so this will

likely not be an issue.

Now consider the case where the L2 cache is not shared, and

this loop over n

is parallelized. Each thread will pack part of

A, and then use the entire block of

A for its local computation.

In the serial case of GEMM, the process of packing of

moves it into a single L2 cache. In contrast, parallelizing this

loop results in various parts of

A being placed into different

L2 caches. This is due to the fact that the packing of

is parallelized. Within the parallelized packing routine, each

thread will pack a different part of

A, and so that part of

A will end up in that thread’s private L2 cache. A cache

coherency protocol must then be relied upon to guarantee that

the pieces of

A are duplicated across the L2 caches, as needed.

This occurs during the execution of the microkernel and may

be overlapped with computation. Because this results in extra

memory movements and relies on cache coherency, this may

or may not be desireable depending on the cost of duplication

among the caches. Notice that if the architecture does not

provide cache coherency, the duplication of the pieces of

must be done manually.

D. Parallelizing the third loop around the inner-kernel (in-

dexed by i

Fig. 5. The third loop around the micro-kernel (ﬁrst loop around Goto’s

inner kernel).

Next, consider the third loop around the micro-kernel at the

bottom of Figure 2. If one parallelizes this ﬁrst loop around

what we call the macro-kernel (indexed by i

), which corre-

sponds to Goto’s inner kernel, each thread will be assigned

a different block of

A, which resides in the L2 cache, and

they will all share the same row panel of

B, which resides in

the L3 cache or main memory. Subsequently, each thread will

multiply its own block of

A with the shared row panel of

Unlike the inner-most two loops around the micro-kernel,

the number of iterations of this loop is not limited by the

blocking sizes; rather, the number of iterations of this loop

depends on the size of m. Notice that when m is less than the

product of m

and the degree of parallelization of the loop,

blocks of

A will be smaller than optimal and performance will

suffer.

Now consider the case where there is a single, shared L2

cache. If this loop is parallelized, there must be multiple blocks

A in this cache. Thus, the size of each

A must be reduced

in size by a factor equal to the degree of parallelization of

this loop. The size of

A is m

× k

, so either or both of

these may be reduced. Notice that if we choose to reduce m

parallelizing this loop is equivalent to parallelizing the ﬁrst

loop around the micro-kernel. If instead each thread has its

own L2 cache, each block of

A resides in its own cache, and

thus it would not need to be resized.

Now consider the case where there are multiple L3 caches.

If this loop is parallelized, each thread will pack a different

part of the row panel of

B into its own L3 cache. Then a cache

coherency protocol must be relied upon to place every portion

B in each L3 cache. As before, if the architecture does not

provide cache coherency, this duplication of the pieces of

must be done manually.

E. Parallelizing the fourth loop around the inner-kernel (in-

dexed by p

Consider the fourth loop around the micro-kernel. If one

parallelizes this second loop around the macro-kernel (indexed

by p

), each thread will be assigned a different block of

and a different block of

B. Unlike in the previously discussed

opportunities for parallelism, each thread will update the

same block of C, potentially creating race conditions. Thus,

parallelizing this loop either requires some sort of locking

mechanism or the creation of copies of the block of C

Fig. 6. Parallelization of the p

loop requires local copies of the block of

C to be made, which are summed upon completion of the loop.

(initialized to zero) so that all threads can update their own

copy, which is then followed by a reduction of these partial

results, as illustrated in Figure 6. This loop should only be

parallelized under very special circumstances. An example

would be when C is small so that (1) only by parallelizing

this loop can a satisfactory level of parallelism be achieved

and (2) reducing (summing) the results is cheap relative to

the other costs of computation. It is for these reasons that

so-called 3D (sometimes called 2.5D) distributed memory

matrix multiplication algorithms [12], [13] choose this loop

for parallelization (in addition to parallelizing one or more of

the other loops).

F. Parallelizing the outer-most loop (indexed by j

Fig. 7. The ﬁfth (outer) loop around the micro-kernel.

Finally, consider the ﬁfth loop around the micro-kernel (the

third loop around the macro-kernel, and the outer-most loop).

If one parallelizes this loop, each thread will be assigned a

different row panel of

B, and each thread will share the whole

matrix A which resides in main memory.

Consider the case where there is a single L3 cache. Then

the size of a panel of

B must be reduced so that multiple of

will ﬁt in the L3 cache. If n

is reduced, then this is equivalent

to parallelizing the 2nd loop around the micro-kernel, in terms

of how the data is partitioned among threads. If instead each

thread has its own L3 cache, then the size of

B will not have

to be altered, as each panel of

B will reside in its own cache.

Parallelizing this loop thus may be a good idea on multi-

socket systems where each CPU has a separate L3 cache.

Additionally, such systems often have a non-uniform memory

access (NUMA) design, and thus it is important to have a

separate panel of

B for each NUMA node, with each panel

residing in that node’s local memory.

Notice that since threads parallelizing this loop do not share

any packed buffers of

A or

B, parallelizing this loop is, from

Anatomy of High-Performance Many-Threaded Matrix Multiplication

Figures

Citations

BLIS: A Framework for Rapidly Instantiating BLAS Functionality

Analytical Modeling Is Enough for High-Performance BLIS

The BLIS Framework: Experiments in Portability

Design of a High-Performance GEMM-like Tensor–Tensor Multiplication

High-Performance Tensor Contraction without Transposition

References

Automatically Tuned Linear Algebra Software

Anatomy of high-performance matrix multiplication

Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

The IBM Blue Gene/Q Compute Chip

Automatically Tuned Linear Algebra Software

Related Papers (5)

Anatomy of high-performance matrix multiplication

A set of level 3 basic linear algebra subprograms

Automatically Tuned Linear Algebra Software

High-performance implementation of the level-3 BLAS

Basic Linear Algebra Subprograms for Fortran Usage

Frequently Asked Questions (14)

Q1. What are the contributions in "Anatomy of high-performance many-threaded matrix multiplication" ?

Q2. What is the way to achieve peak performance?

Q3. What is the reason for parallelizing the inner loops instead of the outer ones?

Q4. How many threads can be used on the PowerPC A2?

Q5. Why is the Xeon Phi a micro-kernel?

Q6. What is the cost of parallelizing the first loop around the micro-kernel?

Q7. What is the way to perform a rank-k update?

Q8. What is the way to parallelize a loop?

Q9. What is the cost of a rank-k update?

Q10. What is the advantage of the jc loop?

Q11. What is the equivalent to parallelizing the 2nd loop around the micro-kernel?

Q12. What is the primary advantage of constraining B to the L3 cache?

Q13. What is the advantage of parallelizing the jr loop?

Q14. What is the reason for the L1 cache size?