Anatomy of High-Performance Many-Threaded Matrix Multiplication
read more
Citations
BLIS: A Framework for Rapidly Instantiating BLAS Functionality
Analytical Modeling Is Enough for High-Performance BLIS
The BLIS Framework: Experiments in Portability
Design of a High-Performance GEMM-like Tensor–Tensor Multiplication
High-Performance Tensor Contraction without Transposition
References
Automatically Tuned Linear Algebra Software
Anatomy of high-performance matrix multiplication
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
The IBM Blue Gene/Q Compute Chip
Related Papers (5)
Frequently Asked Questions (14)
Q2. What is the way to achieve peak performance?
If peak performance is to be achieved, the instruction pipeline that is capable of executing floating point operations should be executing a fused multiply accumulate instruction (FMA) as often as possible.
Q3. What is the reason for parallelizing the inner loops instead of the outer ones?
But more importantly, parallelizing the inner loops instead of the outer loops engenders better spatial locality, as there will be one contiguous block of memory, instead of several blocks of memory that may not be contiguous.
Q4. How many threads can be used on the PowerPC A2?
While there are fewer threads to use on the PowerPC A2 than on the Xeon Phi, 64 hardware threads is still enough to require the parallelization of multiple loops.
Q5. Why is the Xeon Phi a micro-kernel?
Because of the highly parallel nature of the Intel Xeon Phi, the micro-kernel must be designed while keeping the parallelism gained from the core-sharing hardware threads in mind.
Q6. What is the cost of parallelizing the first loop around the micro-kernel?
When parallelized, less time is spent in this loop and thus the cost of bringing that sliver of B̃ into the L1 cache is amortized over less computation.
Q7. What is the way to perform a rank-k update?
When k is just slightly larger than a multiple of 240, an integer number of rank-k updates will be performed with the optimal blocksize kc, and one rank-k update will be performed with a smaller rank.
Q8. What is the way to parallelize a loop?
An example would be when C is small so that (1) only by parallelizing this loop can a satisfactory level of parallelism be achieved and (2) reducing (summing) the results is cheap relative to the other costs of computation.
Q9. What is the cost of a rank-k update?
The rank-k update with a small value of k is expensive because in each micro-kernel call, an mr × nr block of C must be both read from and written to main memory.
Q10. What is the advantage of the jc loop?
The jc loop: Since the Xeon Phi lacks an L3 cache, this loop provides no advantage over the jr loop for parallelizing in the n dimension.
Q11. What is the equivalent to parallelizing the 2nd loop around the micro-kernel?
If nc is reduced, then this is equivalent to parallelizing the 2nd loop around the micro-kernel, in terms of how the data is partitioned among threads.
Q12. What is the primary advantage of constraining B to the L3 cache?
Panel B̃ was packed in such a way that this sliver is stored contiguously, one row (of5The primary advantage of constraining B̃ to the L3 cache is that it is cheaper to access memory in terms of energy efficiency in the L3 cache rather than main memory.
Q13. What is the advantage of parallelizing the jr loop?
parallelizing the jr loop and synchronizing the four hardware threads will reduce bandwidth requirements of the micro-kernel.
Q14. What is the reason for the L1 cache size?
A curiosity is that on both of these architectures the L1 cache is too small to support the multiple hardware threads that are required to attain near-peak performance.