scispace - formally typeset
Open AccessProceedings ArticleDOI

Anatomy of High-Performance Many-Threaded Matrix Multiplication

Reads0
Chats0
TLDR
This work describes how BLIS extends the "GotoBLAS approach" to implementing matrix multiplication (GEMM), and shows that with the advent of many-core architectures such as the IBM PowerPC A2 processor and the Intel Xeon Phi processor, parallelizing both within and around the inner kernel, as the BLIS approach supports, is not only convenient, but also necessary for scalability.
Abstract
BLIS is a new framework for rapid instantiation of the BLAS. We describe how BLIS extends the "GotoBLAS approach" to implementing matrix multiplication (GEMM). While GEMM was previously implemented as three loops around an inner kernel, BLIS exposes two additional loops within that inner kernel, casting the computation in terms of the BLIS micro-kernel so that porting G E M M becomes a matter of customizing this micro-kernel for a given architecture. We discuss how this facilitates a finer level of parallelism that greatly simplifies the multithreading of GEMM as well as additional opportunities for parallelizing multiple loops. Specifically, we show that with the advent of many-core architectures such as the IBM PowerPC A2 processor (used by Blue Gene/Q) and the Intel Xeon Phi processor, parallelizing both within and around the inner kernel, as the BLIS approach supports, is not only convenient, but also necessary for scalability. The resulting implementations deliver what we believe to be the best open source performance for these architectures, achieving both impressive performance and excellent scalability.

read more

Content maybe subject to copyright    Report

Anatomy of High-Performance Many-Threaded
Matrix Multiplication
Tyler M. Smith
, Robert van de Geijn
, Mikhail Smelyanskiy
, Jeff R. Hammond
and Field G. Van Zee
Institute for Computational Engineering and Sciences and Department of Computer Science
The University of Texas at Austin, Austin TX, 78712
Email: tms,rvdg,field@cs.utexas.edu
Parallel Computing Lab Intel Corporation
Santa Clara, CA 95054
Email: mikhail.smelyanskiy@intel.com
Leadership Computing Facility Argonne National Lab
Argonne, IL 60439
Email: jhammond@alcf.anl.gov
Abstract—BLIS is a new framework for rapid instantiation
of the BLAS. We describe how BLIS extends the “GotoBLAS
approach” to implementing matrix multiplication (GEMM). While
GEMM was previously implemented as three loops around an
inner kernel, BLIS exposes two additional loops within that inner
kernel, casting the computation in terms of the BLIS micro-
kernel so that porting GEMM becomes a matter of customizing
this micro-kernel for a given architecture. We discuss how this
facilitates a finer level of parallelism that greatly simplifies the
multithreading of GEMM as well as additional opportunities for
parallelizing multiple loops. Specifically, we show that with the
advent of many-core architectures such as the IBM PowerPC
A2 processor (used by Blue Gene/Q) and the Intel Xeon Phi
processor, parallelizing both within and around the inner kernel,
as the BLIS approach supports, is not only convenient, but also
necessary for scalability. The resulting implementations deliver
what we believe to be the best open source performance for
these architectures, achieving both impressive performance and
excellent scalability.
Index Terms—linear algebra, libraries, high-performance, ma-
trix, BLAS, multicore
I. INTRODUCTION
High-performance implementation of matrix-matrix multi-
plication (GEMM) is both of great practical importance, since
many computations in scientific computing can be cast in
terms of this operation, and of pedagogical importance, since
it is often used to illustrate how to attain high performance
on a novel architecture. A few of the many noteworthy
papers from the past include Agarwal et al. [1] (an early
paper that showed how an implementation in a high level
language—Fortran—can attain high performance), Bilmer et
al. [2] (which introduced auto-tuning and code generation
using the C programming language), Whaley and Dongarra [3]
(which productized the ideas behind PHiPAC), K
˚
agst
¨
om et
al. [4] (which showed that the level-3 BLAS operations can
be implemented in terms of the general rank-k update (GEMM
)), and Goto and van de Geijn [5] (which described what
is currently accepted to be the most effective approach to
implementation, which we will call the GotoBLAS approach).
Very recently, we introduced the BLAS-like Library In-
stantiation Software (BLIS) [6] which can be viewed as a
systematic reimplementation of the GotoBLAS, but with a
number of key insights that greatly reduce the effort for the
library developer. The primary innovation is the insight that
the inner kernel—the smallest unit of computation within the
GotoBLAS GEMM implementation—can be further simplified
into two loops around a micro-kernel. This means that the li-
brary developer needs only implement and optimize a routine
1
that implements the computation of C := AB+C where C is a
small submatrix that fits in the registers of a target architecture.
In a second paper [7], we reported experiences regarding
portability and performance on a large number of current
processors. Most of that paper is dedicated to implementation
and performance on a single core. A brief demonstration of
how BLIS also supports parallelism was included in that paper,
but with few details.
The present paper describes in detail the opportunities for
parallelism exposed by the BLIS implementation of GEMM. It
focuses specifically on how this supports high performance and
scalability when targeting many-core architectures that require
more threads than cores if near-peak performance is to be
attained. Two architectures are examined: the PowerPC A2
processor with 16 cores that underlies IBM’s Blue Gene/Q
supercomputer, which supports four-way hyperthreading for a
total of 64 threads; and the Intel Xeon Phi processor with 60
cores
2
and also supports four-way hyperthreading for a total of
240 threads. It is demonstrated that excellent performance and
scalability can be achieved specifically because of the extra
parallelism that is exposed by the BLIS approach within the
inner kernel employed by the GotoBLAS approach.
It is also shown that when many threads are employed
it is necessary to parallelize in multiple dimensions. This
builds upon Marker et al. [8], which we believe was the
first paper to look at 2D work decomposition for GEMM
on multithreaded architectures. The paper additionally builds
upon work that describe the vendor implementations for the
1
This micro-kernel routine is usually written in assembly code, but may
also be expressed in C with vector intrinsics.
2
In theory, 61 cores can be used for computation. In practice, 60 cores are
usually employed.

Main Memory
L3 cache
L2 cache
+=
L1 cache
registers
j
c
j
c
i
c
i
c
p
c
p
c
j
r
j
r
i
r
i
r
Fig. 1. Illustration of which parts of the memory hierarchy each block of A and B reside in during the execution of the micro-kernel.
PowerPC A2 [9] and the Xeon Phi [10]. BLIS wraps many of
those insights up in a cleaner framework so that exploration of
the algorithmic design space is, in our experience, simplified.
We show performance to be competitive relative to that of
Intel’s Math Kernel Library (MKL) and IBM’s Engineering
and Scientific Subroutine Library (ESSL)
3
.
II. BLIS
In our discussions in this paper, we focus on the special
case C := AB + C, where A, B, and C are m × k, k × n,
and m × n, respectively.
4
It helps to be familiar with the
GotoBLAS approach to implementing GEMM, as described
in [5]. We will briefly review the BLIS approach for a single
core implementation in this section, with the aid of Figure 1.
Our description starts with the outer-most loop, indexed by
j
c
. This loop partitions C and B into (wide) column panels.
Next, A and the current column panel of B are partitioned into
column panels and row panels, respectively, so that the current
column panel of C (of width n
c
) is updated as a sequence of
rank-k updates (with k = k
c
), indexed by p
c
. At this point, the
GotoBLAS approach packs the current row panel of B into a
3
We do not compare to OpenBLAS [11] as there is no implementation for
either the PowerPC A2 or the Xeon Phi, to our knowledge. ATLAS does not
support either architecture under consideration in this paper so no comparison
can be made.
4
We will also write this operation as C += AB.
contiguous buffer,
e
B. If there is an L3 cache, the computation
is arranged to try to keep
e
B in the L3 cache. The primary
reason for the outer-most loop, indexed by j
c
, is to limit the
amount of workspace required for
e
B, with a secondary reason
to allow
e
B to remain in the L3 cache.
5
Now, the current panel of A is partitioned into blocks,
indexed by i
c
, that are packed into a contiguous buffer,
e
A.
The block is sized to occupy a substantial part of the L2
cache, leaving enough space to ensure that other data does not
evict the block. The GotoBLAS approach then implements the
“block-panel” multiplication of
e
A
e
B as its inner kernel, making
this the basic unit of computation. It is here that the BLIS
approach continues to mimic the GotoBLAS approach, except
that it explicitly exposes two additional loops. In BLIS, these
loops are coded portably in C, whereas in GotoBLAS they are
hidden within the implementation of the inner kernel (which
is oftentimes assembly-coded).
At this point, we have
e
A in the L2 cache and
e
B in the L3
cache (or main memory). The next loop, indexed by j
r
, now
partitions
e
B into column “slivers” (micro-panels) of width n
r
.
At a typical point of the computation, one such sliver is in
the L1 cache, being multiplied by
e
A. Panel
e
B was packed in
such a way that this sliver is stored contiguously, one row (of
5
The primary advantage of constraining
e
B to the L3 cache is that it is
cheaper to access memory in terms of energy efficiency in the L3 cache
rather than main memory.

for j
c
= 0, . . . , n 1 in steps of n
c
for p
c
= 0, . . . , k 1 in steps of k
c
for i
c
= 0, . . . , m 1 in steps of m
c
for j
r
= 0, . . . , n
c
1 in steps of n
r
for i
r
= 0, . . . , m
c
1 in steps of m
r
C(i
r
:i
r
+m
r
1, j
r
:j
r
+n
r
1)
+
=
· · ·
o
+=
endfor
i
r
+=
(loop around micro-kernel)
endfor
endfor
endfor
endfor
for j
c
= 0, . . . , n 1 in steps of n
c
for p
c
= 0, . . . , k 1 in steps of k
c
for i
c
= 0, . . . , m 1 in steps of m
c
for j
r
= 0, . . . , n
c
1 in steps of n
r
for i
r
= 0, . . . , m
c
1 in steps of m
r
C(i
r
:i
r
+m
r
1, j
r
:j
r
+n
r
1)
+
=
· · ·
endfor
endfor
+=
j
r
j
r
endfor
endfor
endfor
for j
c
= 0, . . . , n 1 in steps of n
c
for p
c
= 0, . . . , k 1 in steps of k
c
for i
c
= 0, . . . , m 1 in steps of m
c
for j
r
= 0, . . . , n
c
1 in steps of n
r
for i
r
= 0, . . . , m
c
1 in steps of m
r
C(i
r
:i
r
+m
r
1, j
r
:j
r
+n
r
1)
+
=
· · ·
endfor
endfor
endfor
+=
i
c
i
c
endfor
endfor
Fig. 2. Illustration of the three inner-most loops. The loops idexed by i
r
and j
r
are the loops that were hidden inside the GotoBLAS inner kernel.
width n
r
) at a time. Finally, the inner-most loop, indexed by
i
r
, partitions
e
A into row slivers of height m
r
. Block
e
A was
packed in such a way that this sliver is stored contiguously,
one column (of height m
r
) at a time. The BLIS micro-kernel
then multiplies the current sliver of
e
A by the current sliver
of
e
B to update the corresponding m
r
× n
r
block of C. This
micro-kernel performs a sequence of rank-1 updates (outer
products) with columns from the sliver of
e
A and rows from
the sliver of
e
B.
A typical point in the compution is now captured by
Figure 1. A m
r
× n
r
block of C is in the registers. A k
c
× n
r
sliver of
e
B is in the L1 cache. The m
r
× k
c
sliver of
e
A is
streamed from the L2 cache. And so forth. The key takeaway
here is that the layering described in this section can be
captured by the five nested loops around the micro-kernel in
Figure 2.
III. OPPORTUNITIES FOR PARALLELISM
We have now set the stage to discuss opportunities for
parallelism and when those opportunites may be advantageous.
There are two key insights in this section:
In GotoBLAS, the inner kernel is the basic unit of
computation and no parallelization is incorporated within
that inner kernel
6
. The BLIS framework exposes two
6
It is, of course, possible that more recent implementations by Goto deviate
from this. However, these implementations are proprietary.

loops within that inner kernel, thus exposing two extra
opportunities for parallelism, for a total of five.
It is important to use a given memory layer wisely. This
gives guidance as to which loop should be parallelized.
A. Parallelism within the micro-kernel
The micro-kernel is typically implemented as a sequence of
rank-1 updates of the m
r
× n
r
block of C that is accumulated
in the registers. Introducing parallelism over the loop around
these rank-1 updates is ill-advised for three reasons: (1) the
unit of computation is small, making the overhead consider-
able, (2) the different threads would accumulate contributions
to the block of C, requiring a reduction across threads that is
typically costly, and (3) each thread does less computation for
each update of the m
r
× n
r
block of C, so the amortization
of the cost of the update is reduced.
This merely means that parallelizing the loop around the
rank-1 updates is not advisable. One could envision carefully
parallezing the micro-kernel in other ways for a core that re-
quires hyperthreading in order to attain peak performance. But
that kind of parallelism can be described as some combination
of parallelizing the first and second loop around the micro-
kernel. We will revisit this topic later on.
The key for this paper is that the micro-kernel is a basic unit
of computation for BLIS. We focus on how to get parallelism
without having to touch that basic unit of computation.
B. Parallelizing the first loop around the micro-kernel (in-
dexed by i
r
).
+=
i
r
+=
Fig. 3. Left: the micro-kernel. Right: the first loop around the micro-kernel.
Let us consider the first of the three loops in Figure 2. If
one parallelizes the first loop around the micro-kernel (indexed
by i
r
), different instances of the micro-kernel are assigned to
different theads. Our objective is to optimally use fast memory
resources. In this case, the different threads share the same
sliver of
e
B, which resides in the L1 cache.
Notice that regardless of the size of the matrices on which
we operate, this loop has a fixed number of iterations, d
m
c
m
r
e,
since it loops over m
c
in steps of m
r
. Thus, the amount of
parallelism that can be extracted from this loop is quite limited.
Additionally, a sliver of
e
B is brought from the L3 cache into
the L1 cache and then used during each iteration of this loop.
When parallelized, less time is spent in this loop and thus the
cost of bringing that sliver of
e
B into the L1 cache is amortized
over less computation. Notice that the cost of bringing
e
B into
the L1 cache may be overlapped by computation, so it may be
completely or partially hidden. In this case, there is a minimum
amount of computation required to hide the cost of bringing
e
B
into the L1 cache. Thus, parallelizing is acceptable only when
this loop has a large number of iterations. These two factors
mean that this loop should be parallelized only when the ratio
of m
c
to m
r
is large. Unfortunately, this is not usually the case,
as m
c
is usually on the order of a few hundred elements.
C. Parallelizing the second loop around the micro-kernel
(indexed by j
r
).
+=
j
r
j
r
Fig. 4. The second loop around the micro-kernel.
Now consider the second of the loops in Figure 2. If one
parallelizes the second loop around the micro-kernel (indexed
by j
r
), each thread will be assigned a different sliver of
e
B,
which resides in the L1 cache, and they will all share the same
block of
e
A, which resides in the L2 cache. Then, each thread
will multiply the block of
e
A with its own sliver of
e
B.
Similar to the first loop around the micro-kernel, this loop
has a fixed number of iterations, as it iterates over n
c
in steps
of n
r
. The time spent in this loop amortizes the cost of packing
the block of
e
A from main memory into the L2 cache. Thus,
for similar reasons as the first loop around the micro-kernel,
this loop should be parallelized only if the ratio of n
c
to n
r
is large. Fortunately, this is almost always the case, as n
c
is
typically on the order of several thousand elements.
Consider the case where this loop is parallelized and each
thread shares a single L2 cache. Here, one block
e
A will be
moved into the L2 cache, and there will be several slivers of
e
B which also require space in the cache. Thus, it is possible
that either
e
A or the slivers of
e
B will have to be resized so
that all fit into the cache simultaneously. However, slivers of
e
B are small compared to the size of the L2 cache, so this will
likely not be an issue.
Now consider the case where the L2 cache is not shared, and
this loop over n
c
is parallelized. Each thread will pack part of
e
A, and then use the entire block of
e
A for its local computation.
In the serial case of GEMM, the process of packing of
e
A
moves it into a single L2 cache. In contrast, parallelizing this
loop results in various parts of
e
A being placed into different
L2 caches. This is due to the fact that the packing of
e
A
is parallelized. Within the parallelized packing routine, each
thread will pack a different part of
e
A, and so that part of
e
A will end up in that thread’s private L2 cache. A cache
coherency protocol must then be relied upon to guarantee that
the pieces of
e
A are duplicated across the L2 caches, as needed.
This occurs during the execution of the microkernel and may
be overlapped with computation. Because this results in extra
memory movements and relies on cache coherency, this may
or may not be desireable depending on the cost of duplication
among the caches. Notice that if the architecture does not
provide cache coherency, the duplication of the pieces of
e
A
must be done manually.

D. Parallelizing the third loop around the inner-kernel (in-
dexed by i
c
).
Fig. 5. The third loop around the micro-kernel (first loop around Goto’s
inner kernel).
Next, consider the third loop around the micro-kernel at the
bottom of Figure 2. If one parallelizes this first loop around
what we call the macro-kernel (indexed by i
c
), which corre-
sponds to Goto’s inner kernel, each thread will be assigned
a different block of
e
A, which resides in the L2 cache, and
they will all share the same row panel of
e
B, which resides in
the L3 cache or main memory. Subsequently, each thread will
multiply its own block of
e
A with the shared row panel of
e
B.
Unlike the inner-most two loops around the micro-kernel,
the number of iterations of this loop is not limited by the
blocking sizes; rather, the number of iterations of this loop
depends on the size of m. Notice that when m is less than the
product of m
c
and the degree of parallelization of the loop,
blocks of
e
A will be smaller than optimal and performance will
suffer.
Now consider the case where there is a single, shared L2
cache. If this loop is parallelized, there must be multiple blocks
of
e
A in this cache. Thus, the size of each
e
A must be reduced
in size by a factor equal to the degree of parallelization of
this loop. The size of
e
A is m
c
× k
c
, so either or both of
these may be reduced. Notice that if we choose to reduce m
c
,
parallelizing this loop is equivalent to parallelizing the first
loop around the micro-kernel. If instead each thread has its
own L2 cache, each block of
e
A resides in its own cache, and
thus it would not need to be resized.
Now consider the case where there are multiple L3 caches.
If this loop is parallelized, each thread will pack a different
part of the row panel of
e
B into its own L3 cache. Then a cache
coherency protocol must be relied upon to place every portion
of
e
B in each L3 cache. As before, if the architecture does not
provide cache coherency, this duplication of the pieces of
e
B
must be done manually.
E. Parallelizing the fourth loop around the inner-kernel (in-
dexed by p
c
).
Consider the fourth loop around the micro-kernel. If one
parallelizes this second loop around the macro-kernel (indexed
by p
c
), each thread will be assigned a different block of
e
A
and a different block of
e
B. Unlike in the previously discussed
opportunities for parallelism, each thread will update the
same block of C, potentially creating race conditions. Thus,
parallelizing this loop either requires some sort of locking
mechanism or the creation of copies of the block of C
+=
p
c
p
c
+
+
Fig. 6. Parallelization of the p
c
loop requires local copies of the block of
C to be made, which are summed upon completion of the loop.
(initialized to zero) so that all threads can update their own
copy, which is then followed by a reduction of these partial
results, as illustrated in Figure 6. This loop should only be
parallelized under very special circumstances. An example
would be when C is small so that (1) only by parallelizing
this loop can a satisfactory level of parallelism be achieved
and (2) reducing (summing) the results is cheap relative to
the other costs of computation. It is for these reasons that
so-called 3D (sometimes called 2.5D) distributed memory
matrix multiplication algorithms [12], [13] choose this loop
for parallelization (in addition to parallelizing one or more of
the other loops).
F. Parallelizing the outer-most loop (indexed by j
c
).
+=
j
c
j
c
Fig. 7. The fifth (outer) loop around the micro-kernel.
Finally, consider the fifth loop around the micro-kernel (the
third loop around the macro-kernel, and the outer-most loop).
If one parallelizes this loop, each thread will be assigned a
different row panel of
e
B, and each thread will share the whole
matrix A which resides in main memory.
Consider the case where there is a single L3 cache. Then
the size of a panel of
e
B must be reduced so that multiple of
e
B
will fit in the L3 cache. If n
c
is reduced, then this is equivalent
to parallelizing the 2nd loop around the micro-kernel, in terms
of how the data is partitioned among threads. If instead each
thread has its own L3 cache, then the size of
e
B will not have
to be altered, as each panel of
e
B will reside in its own cache.
Parallelizing this loop thus may be a good idea on multi-
socket systems where each CPU has a separate L3 cache.
Additionally, such systems often have a non-uniform memory
access (NUMA) design, and thus it is important to have a
separate panel of
e
B for each NUMA node, with each panel
residing in that node’s local memory.
Notice that since threads parallelizing this loop do not share
any packed buffers of
e
A or
e
B, parallelizing this loop is, from

Citations
More filters
Journal ArticleDOI

BLIS: A Framework for Rapidly Instantiating BLAS Functionality

TL;DR: Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL).
Journal ArticleDOI

Analytical Modeling Is Enough for High-Performance BLIS

TL;DR: The BLAS-like Library Instantiation Software (BLIS) framework, which provides a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation, allows one to analytically determine tuning parameters for high-end instantiations of the matrix-matrix multiplication.
Journal ArticleDOI

The BLIS Framework: Experiments in Portability

TL;DR: It is shown, with very little effort, how the BLIS framework yields sequential and parallel implementations that are competitive with the performance of ATLAS, OpenBLAS, and commercial vendor implementations such as AMD's ACML, IBM's ESSL, and Intel’s MKL libraries.
Journal ArticleDOI

Design of a High-Performance GEMM-like Tensor–Tensor Multiplication

TL;DR: GETT is a novel approach for dense tensor contractions that mirrors the design of a high-performance general matrix–matrix multiplication (GEMM), and exhibits desirable features such as unit-stride memory accesses, cache-awareness, as well as full vectorization, without requiring auxiliary memory.
Journal ArticleDOI

High-Performance Tensor Contraction without Transposition

TL;DR: TBLIS as mentioned in this paper implements tensor contraction using the flexible BLAS-like Instantiation Software (BLIS) framework, which allows transposition (reshaping) of the tensor to be fused with internal partitioning and packing operations, requiring no explicit transposition operations or additional workspace.
References
More filters
Proceedings ArticleDOI

Automatically Tuned Linear Algebra Software

TL;DR: An approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units using the widely used linear algebra kernels called the Basic Linear Algebra Subroutines (BLAS).
Journal ArticleDOI

Anatomy of high-performance matrix multiplication

TL;DR: The basic principles that underlie the high-performance implementation of the matrix-matrix multiplication that is part of the widely used GotoBLAS library are presented.
Proceedings ArticleDOI

Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

TL;DR: PHiPAC was an early attempt to improve software performance by searching in a large design space of possible implementations to find the best one, using code generators that could easily generate a vast assortment of very different points within a design space, and even across very different design spaces altogether.
Journal ArticleDOI

The IBM Blue Gene/Q Compute Chip

TL;DR: The architecture and design of the Compute chip is examined, which combines processors, memory, and communication functions on a single chip to build a massively parallel high-performance computing system out of power-efficient processor chips.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What are the contributions in "Anatomy of high-performance many-threaded matrix multiplication" ?

The authors describe how BLIS extends the “ GotoBLAS approach ” to implementing matrix multiplication ( GEMM ). The authors discuss how this facilitates a finer level of parallelism that greatly simplifies the multithreading of GEMM as well as additional opportunities for parallelizing multiple loops. Specifically, the authors show that with the advent of many-core architectures such as the IBM PowerPC A2 processor ( used by Blue Gene/Q ) and the Intel Xeon Phi processor, parallelizing both within and around the inner kernel, as the BLIS approach supports, is not only convenient, but also necessary for scalability. The resulting implementations deliver what the authors believe to be the best open source performance for these architectures, achieving both impressive performance and excellent scalability. 

If peak performance is to be achieved, the instruction pipeline that is capable of executing floating point operations should be executing a fused multiply accumulate instruction (FMA) as often as possible. 

But more importantly, parallelizing the inner loops instead of the outer loops engenders better spatial locality, as there will be one contiguous block of memory, instead of several blocks of memory that may not be contiguous. 

While there are fewer threads to use on the PowerPC A2 than on the Xeon Phi, 64 hardware threads is still enough to require the parallelization of multiple loops. 

Because of the highly parallel nature of the Intel Xeon Phi, the micro-kernel must be designed while keeping the parallelism gained from the core-sharing hardware threads in mind. 

When parallelized, less time is spent in this loop and thus the cost of bringing that sliver of B̃ into the L1 cache is amortized over less computation. 

When k is just slightly larger than a multiple of 240, an integer number of rank-k updates will be performed with the optimal blocksize kc, and one rank-k update will be performed with a smaller rank. 

An example would be when C is small so that (1) only by parallelizing this loop can a satisfactory level of parallelism be achieved and (2) reducing (summing) the results is cheap relative to the other costs of computation. 

The rank-k update with a small value of k is expensive because in each micro-kernel call, an mr × nr block of C must be both read from and written to main memory. 

The jc loop: Since the Xeon Phi lacks an L3 cache, this loop provides no advantage over the jr loop for parallelizing in the n dimension. 

If nc is reduced, then this is equivalent to parallelizing the 2nd loop around the micro-kernel, in terms of how the data is partitioned among threads. 

Panel B̃ was packed in such a way that this sliver is stored contiguously, one row (of5The primary advantage of constraining B̃ to the L3 cache is that it is cheaper to access memory in terms of energy efficiency in the L3 cache rather than main memory. 

parallelizing the jr loop and synchronizing the four hardware threads will reduce bandwidth requirements of the micro-kernel. 

A curiosity is that on both of these architectures the L1 cache is too small to support the multiple hardware threads that are required to attain near-peak performance.