scispace - formally typeset
Open AccessProceedings ArticleDOI

Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks

TLDR
In this article, a storage format for sparse matrices, called compressed sparse blocks (CSB), is introduced, which allows both Ax and A,x to be computed efficiently in parallel, where A is an n×n sparse matrix with nnzen nonzeros and x is a dense n-vector.
Abstract
This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A,x to be computed efficiently in parallel, where A is an n×n sparse matrix with nnzen nonzeros and x is a dense n-vector. Our algorithms use Θ(nnz) work (serial running time) and Θ(√nlgn) span (critical-path length), yielding a parallelism of Θ(nnz/√nlgn), which is amply high for virtually any large matrix. The storage requirement for CSB is the same as that for the more-standard compressed-sparse-rows (CSR) format, for which computing Ax in parallel is easy but A,x is difficult. Benchmark results indicate that on one processor, the CSB algorithms for Ax and A,x run just as fast as the CSR algorithm for Ax, but the CSB algorithms also scale up linearly with processors until limited by off-chip memory bandwidth.

read more

Content maybe subject to copyright    Report

Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector
Multiplication Using Compressed Sparse Blocks
Aydın Buluc¸
aydin@cs.ucsb.edu
Jeremy T. Fineman
jfineman@csail.mit.edu
Matteo Frigo
matteo@cilk.com
John R. Gilbert
gilbert@cs.ucsb.edu
Charles E. Leiserson
†‡
cel@mit.edu
Dept.of Computer Science
University of California
Santa Barbara, CA 93106
MIT CSAIL
32 Vassar Street
Cambridge, MA 02139
Cilk Arts, Inc.
55 Cambridge Street, Suite 200
Burlington, MA 01803
Abstract
This paper introduces a storage format for sparse matrices, called
compressed sparse blocks (CSB), which allows both Ax and A
T
x
to be computed efficiently in parallel, where A is an n ×n sparse
matrix with nnz n nonzeros and x is a dense n-vector. Our algo-
rithms use Θ(nnz) work (serial running time) and Θ(
nlgn) span
(critical-path length), yielding a parallelism of Θ(nnz/
nlgn),
which is amply high for virtually any large matrix. The storage
requirement for CSB is esssentially the same as that for the more-
standard compressed-sparse-rows (CSR) format, for which com-
puting Ax in parallel is easy but A
T
x is difficult. Benchmark results
indicate that on one processor, the CSB algorithms for Ax and A
T
x
run just as fast as the CSR algorithm for Ax, but the CSB algo-
rithms also scale up linearly with processors until limited by off-
chip memory bandwidth.
Categories and Subject Descriptors
F.2.1 [Analysis of Algorithms and Problem Complexity]: Nu-
merical Algorithms and Problems—computations on matrices; G.4
[Mathematics of Computing]: Mathematical Software—parallel
and vector implementations
General Terms
Algorithms, Design, Experimentation, Performance, Theory
Keywords
Compressed sparse blocks, compressed sparse columns, com-
pressed sparse rows, matrix transpose, matrix-vector multiplica-
tion, multithreaded algorithm, parallelism, span, sparse matrix,
storage format, work.
This work was supported in part by the National Science Foundation
under Grants 0540248, 0615215, 0712243, 0822896, and 0709385, and by
MIT Lincoln Laboratory under contract 7000012980.
1 Introduction
When multiplying a large n ×n sparse matrix A having nnz nonze-
ros by a dense n-vector x, the memory bandwidth for reading A
can limit overall performance. Consequently, most algorithms to
compute Ax store A in a compressed format. One simple “tuple”
representation involves storing each nonzero of A as a triple con-
sisting of its row index, its column index, and the nonzero value
itself. This representation, however, requires storing 2nnz row and
column indices, in addition to the nonzeros. The current standard
storage format for sparse matrices in scientific computing, com-
pressed sparse rows (CSR) [32], is more efficient, because it stores
only n + nnz indices or pointers. This reduction in storage of CSR
compared with the tuple representation tends to result in faster se-
rial algorithms.
In the domain of parallel algorithms, however, CSR has its lim-
itations. Although CSR lends itself to a simple parallel algorithm
for computing the matrix-vector product Ax, this storage format
does not admit an efficient parallel algorithm for computing the
product A
T
x, where A
T
denotes the transpose of the matrix A
or equivalently, for computing the product x
T
A of a row vector x
T
by A. Although one could use compressed sparse columns (CSC)
to compute A
T
x, many applications, including iterative linear sys-
tem solvers such as biconjugate gradients and quasi-minimal resid-
ual [32], require both Ax and A
T
x. One could transpose A explicitly,
but computing the transpose for either CSR or CSC formats is ex-
pensive. Moreover, since matrix-vector multiplication for sparse
matrices is generally limited by memory bandwidth, it is desirable
to find a storage format for which both Ax and A
T
x can be computed
in parallel without performing more than nnz fetches of nonzeros
from the memory to compute either product.
This paper presents a new storage format called compressed
sparse blocks (CSB) for representing sparse matrices. Like CSR
and CSC, the CSB format requires only n + nnz words of storage
for indices. Because CSB does not favor rows over columns or vice
versa, it admits efficient parallel algorithms for computing either Ax
or A
T
x, as well as for computing Ax when A is symmetric and only
half the matrix is actually stored.
Previous work on parallel sparse matrix-vector multiplication
has focused on reducing communication volume in a distributed-
memory setting, often by using graph or hypergraph partitioning
techniques to find good data distributions for particular matrices
( [7,38], for example). Good partitions generally exist for matrices
whose structures arise from numerical discretizations of partial dif-
ferential equations in two or three spatial dimensions. Our work, by
contrast, is motivated by multicore and manycore architectures, in

0
100
200
300
400
500
1 2 3 4 5 6 7 8
MFlops/sec
Processors
CSB_SpMV
CSB_SpMV_T
CSR_SpMV(Serial)
CSR_SpMV_T(Serial)
Star-P(y=Ax)
Star-P(y’=x’A)
Figure 1: Average performance of Ax and A
T
x operations on 13 different
matrices from our benchmark test suite. CSB
_
SpMV and CSB
_
SpMV
_
T
use compressed sparse blocks to perform Ax and A
T
x, respectively. CSR
_
SpMV (Serial) and CSR
_
SpMV
_
T (Serial) use OSKI [39] and compressed
sparse rows without any matrix-specific optimizations. Star-P (y=Ax) and
Star-P (y’=x’A) use Star-P [34], a parallel code based on CSR. The ex-
periments were run on a ccNUMA architecture powered by AMD Opteron
8214 (Santa Rosa) processors.
which parallelism and memory bandwidth are key resources. Our
algorithms are efficient in these measures for matrices with arbi-
trary nonzero structure.
Figure 1 presents an overall summary of achieved performance.
The serial CSR implementation uses plain OSKI [39] without any
matrix-specific optimizations. The graph shows the average perfor-
mance over all our test matrices except for the largest, which failed
to run on Star-P [34] due to memory constraints. The performance
is measured in Mflops (Millions of FLoating-point OPerationS) per
second. Both Ax and A
T
x take 2 nnz flops. To measure perfor-
mance, we divide this value by the time it takes for the computation
to complete. Section 7 provides detailed performance results.
The remainder of this paper is organized as follows. Section 2
discusses the limitations of the CSR/CSC formats for parallelizing
Ax and A
T
x calculations. Section 3 describes the CSB format for
sparse matrices. Section 4 presents the algorithms for computing
Ax and A
T
x using the CSB format, and Section 5 provides a theo-
retical analysis of their parallel performance. Section 6 describes
the experimental setup we used, and Section 7 presents the results.
Section 8 offers some concluding remarks.
2 Conventional storage formats
This section describes the CSR and CSC sparse-matrix storage for-
mats and explores their limitations when it comes to computing
both Ax and A
T
x in parallel. We review the work/span formulation
of parallelism and show that performing Ax with CSR (or equiva-
lently A
T
x with CSC) yields ample parallelism. We consider vari-
ous strategies for performing A
T
x in parallel with CSR (or equiva-
lently Ax with CSC) and why they are problematic.
The compressed sparse row (CSR) format stores the nonzeros
(and ideally only the nonzeros) of each matrix row in consecutive
memory locations, and it stores an index to the first stored ele-
ment of each row. In one popular variant [14], CSR maintains one
floating-point array val[nnz] and two integer arrays, col
_
ind[nnz]
and row
_
ptr[
n
] to store the matrix A = (a
ij
). The row
_
ptr array
stores the index of each row in val. That is, if val[k] stores matrix
element a
ij
, then row
_
ptr[i] k < row
_
ptr[i + 1]. The col
_
ind ar-
ray stores the column indices of the elements in the val array. That
CSR
_
SPMV(A,x, y)
1 n A.rows
2 for i 0 to n1 in parallel
3 do y[i] 0
4 for k A.row
_
ptr[i] to A.row
_
ptr[i + 1] 1
5 do y[i] y[i] + A. val[k]·x[A.col
_
ind[k]]
Figure 2: Parallel procedure for computing y Ax, where the n×n matrix
A is stored in CSR format.
is, if val[k] stores matrix element a
ij
, then col
_
ind[k] = j.
The compressed sparse column (CSC) format is analogous to
CSR, except that the nonzeros of each column, instead of row, are
stored in contiguous memory locations. In other words, the CSC
format for A is obtained by storing A
T
in CSR format.
The earliest written description of CSR that we have been able
to divine from the literature is an unnamed “scheme” presented in
Table 1 of the 1967 article [36] by Tinney and Walker, although
in 1963 Sato and Tinney [33] allude to what is probably CSR.
Markowitz’s seminal paper [28] on sparse Gaussian elimination
does not discuss data structures, but it is likely that Markowitz used
such a format as well. CSR and CSC have since become ubiquitous
in sparse matrix computation [13,16,17,21,23,32].
The following lemma states the well-known bound on space used
by the index data in the CSR format (and hence the CSC format as
well). By index data, we mean all data other than the nonzeros
that is, the row
_
ptr and col
_
ind arrays.
Lemma 1. The CSR format uses nlgnnz+ nnzlgn bits of index
data for an n×n matrix.
For a CSR matrix A, computing y Ax in parallel is straightfor-
ward, as shown in Figure 2. Procedure CSR
_
SPMV in the figure
computes each element of the output array in parallel, and it does
not suffer from race conditions, because each parallel iteration i
writes to a single location y[i] which is not updated by any other
iteration.
We shall measure the complexity of this code, and other codes in
this paper, in terms of work and span [10, Ch. 27]:
The work, denoted by T
1
, is the running time on 1 processor.
The span,
1
denoted by T
, is running time on an infinite num-
ber of processors.
The parallelism of the algorithm is T
1
/T
, which corresponds to
the maximum possible speedup on any number of processors. Gen-
erally, if a machine has somewhat fewer processors than the paral-
lelism of an application, a good scheduler should be able to achieve
linear speedup. Thus, for a fixed amount of work, our goal is to
achieve a sufficiently small span so that the parallelism exceeds the
number of processors by a reasonable margin.
The work of CSR
_
SPMV is Θ(nnz), assuming, as we shall, that
nnz n, because the body of the outer loop starting in line 2 ex-
ecutes for n iterations, and the body of the inner loop starting in
line 4 executes for the number of nonzeros in the ith row, for a total
of nnz times.
The span of CSR
_
SPMV depends on the maximum number nr
of nonzeros in any row of the matrix A, since that number deter-
mines the worst-case time of any iteration of the loop in line 4.
The n iterations of the parallel loop in line 2 contribute Θ(lgn) to
the span, assuming that loops are implemented as binary recursion.
Thus, the total span is Θ(nr+ lgn).
The parallelism is therefore Θ(nnz/(nr+ lgn)). In many com-
mon situations, we have nnz = Θ(n), which we will assume for
estimation purposes. The maximum number nr of nonzeros in any
1
The literature also uses the terms depth [3] and critical-path length [4].

CSR
_
SPMV
_
T(A, x, y)
1 n A.cols
2 for i 0 to n1
3 do y[i] 0
4 for i 0 to n1
5 do for k A. row
_
ptr[i] to A.row
_
ptr[i + 1] 1
6 do y[A. col
_
ind[k]] y[A. col
_
ind[k]] + A. val[k]·x[i]
Figure 3: Serial procedure for computing y A
T
x, where the n ×n matrix
A is stored in CSR format.
row can vary considerably, however, from a constant, if all rows
have an average number of nonzeros, to n, if the matrix has a dense
row. If nr = O(1), then the parallelism is Θ(nnz/ lgn), which is
quite high for a matrix with a billion nonzeros. In particular, if we
ignore constants for the purpose of making a ballpark estimate, we
have nnz/lgn 10
9
/(lg10
9
) > 3×10
7
, which is much larger than
any number of processors one is likely to encounter in the near fu-
ture. If nr = Θ(n), however, as is the case when there is even a
single dense row, we have parallelism Θ(nnz/n) = Θ(1), which
limits scalability dramatically. Fortunately, we can parallelize the
inner loop (line 4) using divide-and-conquer recursion to compute
the sparse inner product in lg(nr) span without affecting the asymp-
totic work, thereby achieving parallelism Θ(nnz/ lgn) in all cases.
Computing A
T
x serially can be accomplished by simply inter-
changing the row and column indices [15], yielding the pseudocode
shown in Figure 3. The work of procedure CSR
_
SPMV
_
T is
Θ(nnz), the same as CSR
_
SPMV.
Parallelizing CSR
_
SPMV
_
T is not straightforward, however.
We shall review several strategies to see why it is problematic.
One idea is to parallelize the loops in lines 2 and 5, but this strat-
egy yields minimal scalability. First, the span of the procedure is
Θ(n), due to the loop in line 4. Thus, the parallelism can be at
most O(nnz/n), which is a small constant in most common situ-
ations. Second, in any practical system, the communication and
synchronization overhead for executing a small loop in parallel is
much larger than the execution time of the few operations executed
in line 6.
Another idea is to execute the loop in line 4 in parallel.
Unfortunately, this strategy introduces race conditions in the
read/modify/write to y[A.col
_
ind[k]] in line 6.
2
These races can
be addressed in two ways, neither of which is satisfactory.
The first solution involves locking column col
_
ind[k] or using
some other form of atomic update.
3
This solution is unsatisfac-
tory because of the high overhead of the lock compared to the cost
of the update. Moreover, if A contains a dense column, then the
contention on the lock is Θ(n), which completely destroys any par-
allelism in the common case where nnz = Θ(n).
The second solution involves splitting the output array y into
multiple arrays y
p
in a way that avoids races, and then accumu-
lating y Σ
p
y
p
at the end of the computation. For example, in a
system with P processors (or threads), one could postulate that pro-
cessor p only operates on array y
p
, thereby avoiding any races. This
solution is unsatisfactory because the work becomes Θ(nnz+Pn),
where the last term comes from the need to initialize and accumu-
late P (dense) length-n arrays. Thus, the parallel execution time is
Θ((nnz+Pn)/P) = (n) no matter how many processors are avail-
able.
A third idea for parallelizing A
T
x is to compute the transpose
2
In fact, if nnz > n, then the “pigeonhole principle” guarantees that the
program has at least one race condition.
3
No mainstream hardware supports atomic update of floating-point
quantities, however.
explicitly and then use CSR
_
SPMV. Unfortunately, parallel trans-
position of a sparse matrix in CSR format is costly and encounters
exactly the same problems we are trying to avoid. Moreover, ev-
ery element is accessed at least twice: once for the transpose, and
once for the multiplication. Since the calculation of a matrix-vector
product tends to be memory-bandwidth limited, this strategy is gen-
erally inferior to any strategy that accesses each element only once.
Finally, of course, we could store the matrix A
T
in CSR format,
that is, storing A in CSC format, but then computing Ax becomes
difficult.
To close this section, we should mention that if the matrix A is
symmetric, so that only about half the nonzeros need be stored
for example, those on or above the diagonal then computing
Ax in parallel for CSR is also problematic. For this example, the
elements below the diagonal are visited in an inconvenient order,
as if they were stored in CSC format.
3 The CSB storage format
This section describes the CSB storage format for sparse matrices
and shows that it uses the same amount of storage space as the
CSR and CSC formats. We also compare CSB to other blocking
schemes.
For a given block-size parameter β, CSB partitions the n ×n
matrix A into n
2
/β
2
equal-sized β×β square blocks
4
A =
A
00
A
01
··· A
0,n/β1
A
10
A
11
··· A
1,n/β1
.
.
.
.
.
.
.
.
.
.
.
.
A
n/β1,0
A
n/β1,1
··· A
n/β1,n/β1
,
where the block A
ij
is the β ×β submatrix of A containing el-
ements falling in rows iβ, iβ + 1,...,(i + 1)β 1 and columns
jβ, jβ + 1, . . . , ( j + 1)β 1 of A. For simplicity of presentation,
we shall assume that β is an exact power of 2 and that it divides n;
relaxing these assumptions is straightforward.
Many or most of the individual blocks A
ij
are hypersparse [6],
meaning that the ratio of nonzeros to matrix dimension is asymp-
totically 0. For example, if β =
n and nnz = cn, the average block
has dimension
n and only c nonzeros. The space to store a block
should therefore depend only on its nonzero count, not on its di-
mension.
CSB represents a block A
ij
by compactly storing a triple for each
nonzero, associating with the nonzero data element a row and col-
umn index. In contrast to the column index stored for each nonzero
in CSR, the row and column indices lie within the submatrix A
ij
,
and hence require fewer bits. In particular, if β =
n, then each
index into A
ij
requires only half the bits of an index into A. Since
these blocks are stored contiguously in memory, CSB uses an aux-
iliary array of pointers to locate the beginning of each block.
More specifically, CSB maintains a floating-point array
val[nnz], and three integer arrays row
_
ind[nnz], col
_
ind[nnz], and
blk
_
ptr[n
2
/β
2
]. We describe each of these arrays in turn.
The val array stores all the nonzeros of the matrix and is anal-
ogous to CSR’s array of the same name. The difference is that
CSR stores rows contiguously, whereas CSB stores blocks con-
tiguously. Although each block must be contiguous, the ordering
among blocks is flexible. Let f (i, j) be the bijection from pairs of
block indices to integers in the range 0,1,...,n
2
/β
2
1 that de-
scribes the ordering among blocks. That is, f(i, j) < f (i
, j
) if and
4
The CSB format may be easily extended to nonsquare n×m matrices.
In this case, the blocks remain as square β×β matrices, and there are nm/β
2
blocks.

only if A
ij
appears before A
i
j
in val. We discuss choices of order-
ing later in this section.
The row
_
ind and col
_
ind arrays store the row and column in-
dices, respectively, of the elements in the val array. These indices
are relative to the block containing the particular element, not the
entire matrix, and hence they range from 0 to β1. That is, if val[k]
stores the matrix element a
iβ+r, jβ+c
, which is located in the rth row
and cth column of the block A
ij
, then row
_
ind = r and col
_
ind = c.
As a practical matter, we can pack a corresponding pair of elements
of row
_
ind and col
_
ind into a single integer word of 2lgβ bits so
that they make a single array of length nnz, which is comparable to
the storage needed by CSR for the col
_
ind array.
The blk
_
ptr array stores the index of each block in the val array,
which is analogous to the row
_
ptr array for CSR. If val[k] stores
a matrix element falling in block A
ij
, then blk
_
ptr[ f(i, j)] k <
blk
_
ptr[ f(i, j)+ 1].
The following lemma states the storage used for indices in the
CSB format.
Lemma 2. The CSB format uses (n
2
/β
2
)lgnnz+2nnzlgβ bits of
index data.
Proof. Since the val array contains nnz elements, referencing
an element requires lgnnz bits, and hence the blk
_
ptr array uses
(n
2
/β
2
)lgnnz bits of storage.
For each element in val, we use lgβ bits to represent the row
index and lgβ bits to represent the column index, requiring a total
of nnzlgβ bits for each of row
_
ind and col
_
ind. Adding the space
used by all three indexing arrays completes the proof.
To better understand the storage requirements of CSB, we
present the following corollary for β =
n. In this case, both CSR
(Lemma 1) and CSB use the same storage.
Corollary 3. The CSB format uses nlgnnz+nnzlgn bits of index
data when β =
n.
Thus far, we have not addressed the ordering of elements within
each block or the ordering of blocks. Within a block, we use a Z-
Morton ordering [29], storing first all those elements in the top-left
quadrant, then the top-right, bottom-left, and finally bottom-right
quadrants, using the same layout recursively within each quadrant.
In fact, these quadrants may be stored in any order, but the recursive
ordering is necessary for our algorithm to achieve good parallelism
within a block.
The choice of storing the nonzeros within blocks in a recursive
layout is opposite to the common wisdom for storing dense matri-
ces [18]. Although most compilers and architectures favor conven-
tional row/column ordering for optimal prefetching, the choice of
layout within the block becomes less significant for sparse blocks
as they already do not take full advantage of such features. More
importantly, a recursive ordering allows us to efficiently determine
the four quadrants of a block using binary search, which is crucial
for parallelizing individual blocks.
Our algorithm and analysis do not, however, require any particu-
lar ordering among blocks. A Z-Morton ordering (or any recursive
ordering) seems desirable as it should get better performance in
practice by providing spatial locality, and it matches the ordering
within a block. Computing the function f (i, j), however, is simpler
for a row-major or column-major ordering among blocks.
Comparison with other blocking methods
A blocked variant of CSR, called BCSR, has been used for im-
proving register reuse [24]. In BCSR, the sparse matrix is divided
into small dense blocks that are stored in consecutive memory loca-
tions. The pointers are maintained to the first block on each row of
blocks. BCSR storage is converse to CSB storage, because BCSR
stores a sparse collection of dense blocks, whereas CSB stores a
dense collection of sparse blocks. We conjecture that it would be
advantageous to apply BCSR-style register blocking to each indi-
vidual sparse block of CSB.
Nishtala et al. [30] have proposed a data structure similar to CSB
in the context of cache blocking. Our work differs from theirs in
two ways. First, CSB is symmetric without favoring rows over
columns. Second, our algorithms and analysis for CSB are de-
signed for parallelism instead of cache performance. As shown
in Section 5, CSB supports ample parallelism for algorithms com-
puting Ax and A
T
x, even on sparse and irregular matrices.
Blocking is also used in dense matrices. The Morton-hybrid lay-
out [1,27], for example, uses a parameter equivalent to our param-
eter β for selecting the block size. Whereas in CSB we store ele-
ments in a Morton ordering within blocks and an arbitrary ordering
among blocks, the Morton-hybrid layout stores elements in row-
major order within blocks and a Morton ordering among blocks.
The Morton-hybrid layout is designed to take advantage of hard-
ware and compiler optimizations (within a block) while still ex-
ploiting the cache benefits of a recursive layout. Typically the block
size is chosen to be 32×32, which is significantly smaller than the
Θ(
n) block size we propose for CSB. The Morton-hybrid lay-
out, however, considers only dense matrices, for which designing
a matrix-vector multiplication algorithm with good parallelism is
significantly easier.
4 Matrix-vector multiplication using CSB
This section describes a parallel algorithm for computing the
sparse-matrix dense-vector product y Ax, where A is stored in
CSB format. This algorithm can be used equally well for comput-
ing y A
T
x by switching the roles of row and column. We first
give an overview of the algorithm and then describe it in detail.
At a high level, the CSB
_
SPMV multiplication algorithm sim-
ply multiplies each “blockrow” by the vector x in parallel, where
the ith blockrow is the row of blocks (A
i0
A
i1
···A
i,n/β1
). Since
each blockrow multiplication writes to a different portion of the
output vector, this part of the algorithm contains no races due to
write conflicts.
If the nonzeros were guaranteed to be distributed evenly among
block rows, then the simple blockrow parallelism would yield an
efficient algorithm with n/β-way parallelism by simply performing
a serial multiplication for each blockrow. One cannot, in general,
guarantee that distribution of nonzeros will be so nice, however. In
fact, sparse matrices in practice often include at least one dense row
containing roughly n nonzeros, whereas the number of nonzeros
is only nnz cn for some small constant c. Thus, performing a
serial multiplication for each blockrow yields no better than c-way
parallelism.
To make the algorithm robust to matrices of arbitrary nonzero
structure, we must parallelize the blockrow multiplication when a
blockrow contains “too many” nonzeros. This level of paralleliza-
tion requires care to avoid races, however, because two blocks in
the same blockrow write to the same region within the output vec-
tor. Specifically, when a blockrow contains (β) nonzeros, we re-
cursively divide it “in half, yielding two subblockrows, each con-
taining roughly half the nonzeros. Although each of these sub-
blockrows can be multiplied in parallel, they may need to write to
the same region of the output vector. To avoid the races that might
arise due to write conflicts between the subblockrows, we allocate a
temporary vector to store the result of one of the subblockrows and
allow the other subblockrow to use the output vector. After both
subblockrow multiplications complete, we serially add the tempo-
rary vector into the output vector.
To facilitate fast subblockrow divisions, we first partition the
blockrow into “chunks” of consecutive blocks, each containing at

CSB
_
SPMV(A,x, y)
1 for i 0 to n/β1 in parallel // For each blockrow.
2 do Initialize a dynamic array R
i
3 R
i
[0] 0
4 count 0 // Count nonzeroes in chunk.
5 for j 0 to n/β2
6 do count count+nnz(A
ij
)
7 if count+ nnz(A
i, j+1
) > Θ(β)
8 then // End the chunk, since the next block
// makes it too large.
9 append j to R
i
// Last block in chunk.
10 count 0
11 append n/β1 to R
i
12 CSB
_
BLOCKROWV(A,i,R
i
,x,y[iβ..(i+ 1)β1])
Figure 4: Pseudocode for the matrix-vector multiplication y Ax. The
procedure CSB
_
BLOCKROWV (pseudocode for which can be found in
Figure 5) as called here multiplies the blockrow by the vector x and writes
the output into the appropriate region of the output vector y. The nota-
tion x[a..b] means the subarray of x starting at index a and ending at in-
dex b. The function nnz(A
ij
) is a shorthand for A.blk
_
ptr[ f (i, j) + 1]
A.blk
_
ptr[ f (i, j)], which calculates the number of nonzeros in the block A
ij
.
For conciseness, we have overloaded the Θ(β) notation (in line 7) to mean
“a constant times β”; any constant suffices for the analysis, and we use the
constant 3 in our implementation.
most O(β) nonzeros (when possible) and (β) nonzeros on aver-
age. The lower bound of (β) will allow us to amortize the cost
of writing to the length-β temporary vector against the nonzeros in
the chunk. By dividing a blockrow “in half, we mean assigning to
each subblockrow roughly half the chunks.
Figure 4 gives the top-level algorithm, performing each block-
row vector multiplication in parallel. The for...in parallel do
construct means that each iteration of the for loop may be executed
in parallel with the others. For each loop iteration, we partition
the blockrow into chunks in lines 2–11 and then call the blockrow
multiplication in line 12. The array R
i
stores the indices of the
last block in each chunk; specifically, the kth chunk, for k > 0, in-
cludes blocks (A
i,R
i
[k1]+1
A
i,R
i
[k1]+2
···A
i,R
i
[k]
). A chunk consists
of either a single block containing (β) nonzeros, or it consists of
many blocks containing O(β) nonzeros in total. To compute chunk
boundaries, just iterate over blocks (in lines 5–10) until enough
nonzeros are accrued.
Figure 5 gives the parallel algorithm CSB
_
BLOCKROWV for
multiplying a blockrow by a vector, writing the result into the
length-β vector y. In lines 22–29, the algorithm recursively di-
vides the blockrow such that each half receives roughly the same
number of chunks. We find the appropriate middles of the chunk
array R and the input vector x in lines 22 and 23, respectively. We
then allocate a length-β temporary vector z (line 24) and perform
the recursive multiplications on each subblockrow in parallel (lines
25–27), having one of the recursive multiplications write its output
to z. When these recursive multiplications complete, we merge the
outputs into the vector y (lines 28–29).
The recursion bottoms out when the blockrow consists of a sin-
gle chunk (lines 12–21). If this chunk contains many blocks, it is
guaranteed to contain at most Θ(β) nonzeros, which is sufficiently
sparse to perform the serial multiplication in line 20. If, on the
other hand, the chunk is a single block, it may contain as many as
β
2
n nonzeros. A serial multiplication here, therefore, would be
the bottleneck in the algorithm. Instead, we perform the parallel
block-vector multiplication CSB
_
BLOCKV in line 18.
If the blockrow recursion reaches a single block, we perform a
parallel multiplication of the block by the vector, given in Figure 6.
CSB
_
BLOCKROWV(A,i,R,x,y)
11 if R.length = 2 // The subblockrow is a single chunk.
12 then R[0] + 1 // Leftmost block in chunk.
13 r R[1] // Rightmost block in chunk.
14 if = r
15 then // The chunk is a single (dense) block.
16 start A.blk
_
ptr[ f(i, )]
17 end A.blk
_
ptr[ f(i, ) + 1] 1
18 CSB
_
BLOCKV(A,start, end,β,x,y)
19 else // The chunk is sparse.
20 multiply y (A
i
A
i,ℓ+1
···A
ir
)x serially
21 return
// Since the block row is “dense, split it in half.
22 mid
R.length/2
1 // Divide chunks in half.
// Calculate the dividing point in the input vector x.
23 xmid β·(R[mid] R[0])
24 allocate a length-β temporary vector z, initialized to 0
25 in parallel
26 do CSB
_
BLOCKROWV(A,i,R[0..mid],x[0..xmid1],y)
27 do CSB
_
BLOCKROWV(A,i,R[mid..R.length1],
x[xmid..x.length1],z)
28 for k 0 to β1
29 do y[k] y[k]+ z[k]
Figure 5: Pseudocode for the subblockrow vector product y (A
i
A
i,ℓ+1
···A
ir
)x. The in parallel do. ..do. .. construct indicates that all of the do
code blocks may execute in parallel. The procedure CSB
_
BLOCKV (pseu-
docode for which can be found in Figure 6) calculates the product of the
block and the vector in parallel.
The block-vector multiplication proceeds by recursively dividing
the (sub)block M into quadrants M
00
, M
01
, M
10
, and M
11
, each of
which is conveniently stored contiguously in the Z-Morton-ordered
val, row
_
ind, and col
_
ind arrays between indices start and end. We
perform binary searches to find the appropriate dividing points in
the array in lines 34–36.
To understand the pseudocode, consider the search for the divid-
ing point s
2
between M
00
M
01
and M
10
M
11
. For any recursively
chosen dim×dim matrix M, the column indices and row indices
of all elements have the same leading lgβ lgdim bits. Moreover,
for those elements in M
00
M
01
, the next bit in the row index is a
0, whereas for those in elements in M
10
M
11
, the next bit in the
row index is 1. The algorithm does a binary search for the point at
which this bit flips. The cases for the dividing point between M
00
and M
01
or M
10
and M
11
are similar, except that we focus on the
column index instead of the row index.
After dividing the matrix into quadrants, we execute the matrix
products involving matrices M
00
and M
11
in parallel (lines 37–39),
as they do not conflict on any outputs. After completing these prod-
ucts, we execute the other two matrix products in parallel (lines
40–42).
5
This procedure resembles a standard parallel divide-and-
conquer matrix multiplication, except that our base case of serial
multiplication starts at a matrix containing Θ(dim) nonzeros (lines
29–32). Note that although we pass the full length-β arrays x and
y to each recursive call, the effective length of each array is halved
implicitly by partitioning M into quadrants. Passing the full arrays
is a technical detail required to properly compute array indices, as
the indices A.row
_
ind and A.col
_
ind store offsets within the block.
The CSB
_
SPMV
_
T algorithm is identical to CSB
_
SPMV, ex-
cept that we operate over blockcolumns rather than blockrows.
5
The algorithm may instead do M
00
and M
10
in parallel followed by M
01
and M
11
in parallel without affecting the performance analysis. Presenting
the algorithm with two choices may yield better load balance.

Citations
More filters
Journal ArticleDOI

The Combinatorial BLAS: design, implementation, and applications

TL;DR: The parallel Combinatorial BLAS is described, which consists of a small but powerful set of linear algebra primitives specifically targeting graph and data mining applications, and an extensible library interface and some guiding principles for future development are provided.
Journal ArticleDOI

The tensor algebra compiler

TL;DR: TACO as mentioned in this paper is a C++ library that automatically generates compound tensor algebra operations on dense and sparse tensors, which can be used in machine learning, data analytics, engineering and the physical sciences.
Proceedings ArticleDOI

CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication

TL;DR: CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, GPUs and Xeon Phi, is proposed for real-world applications such as a solver with only tens of iterations because of its low-overhead for format conversion.
Proceedings ArticleDOI

Data reorganization in memory using 3D-stacked DRAM

TL;DR: A two pronged approach for efficient data reorganization is presented, which combines a proposed DRAM-aware reshape accelerator integrated within 3D-stacked DRAM, and a mathematical framework that is used to represent and optimize the reorganization operations.
Book ChapterDOI

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi

TL;DR: In this article, the performance of the Xeon Phi coprocessor for sparse linear algebra kernels is investigated and the important hardware details and show that Xeon Phi's sparse kernel performance is very promising and even better than that of cutting-edge CPUs and GPUs.
References
More filters
Book

Introduction to Algorithms

TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
Book

Iterative Methods for Sparse Linear Systems

Yousef Saad
TL;DR: This chapter discusses methods related to the normal equations of linear algebra, and some of the techniques used in this chapter were derived from previous chapters of this book.
Book

The C++ Programming Language

TL;DR: Bjarne Stroustrup makes C even more accessible to those new to the language, while adding advanced information and techniques that even expert C programmers will find invaluable.
Journal ArticleDOI

Introduction to algorithms: 4. Turtle graphics

TL;DR: In this article, a language similar to logo is used to draw geometric pictures using this language and programs are developed to draw geometrical pictures using it, which is similar to the one we use in this paper.
Frequently Asked Questions (13)
Q1. What have the authors contributed in "Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks" ?

This paper introduces a storage format for sparse matrices, called compressed sparse blocks ( CSB ), which allows both Ax and ATx to be computed efficiently in parallel, where A is an n× n sparse matrix with nnz ≥ n nonzeros and x is a dense n-vector. 

For each element in val, the authors use lgβ bits to represent the row index and lgβ bits to represent the column index, requiring a total of nnz lgβ bits for each of row_ind and col_ind. 

If the nonzeros were guaranteed to be distributed evenly among block rows, then the simple blockrow parallelism would yield an efficient algorithm with n/β-way parallelism by simply performing a serial multiplication for each blockrow. 

The Z-Morton ordering on nonzeros in each block is equivalent to first interleaving the bits of row_ind and col_ind, and then sorting the nonzeros using these bit-interleaved values as the keys. 

The current standard storage format for sparse matrices in scientific computing, compressed sparse rows (CSR) [32], is more efficient, because it stores only n + nnz indices or pointers. 

The compressed sparse row (CSR) format stores the nonzeros (and ideally only the nonzeros) of each matrix row in consecutive memory locations, and it stores an index to the first stored element of each row. 

For CSB, the reported mean/max values are obtained by setting the block dimension β to be approximately √ n, so that they are comparable with statistics from CSC. 

In other words, if max(nnz(Ai)) < 2 ·mean(nnz(Ai)), then the matrix is considered to have balanced blockrows and the optimization is applied. 

The bitmasks are determined dynamically by the CSB constructor depending on the input matrix and the data type used for storing matrix indices. 

Converting to and from bit-interleaved integers, however, is expensive with current hardware support,6 which would be necessary for the serial base case in lines 29–32. 

This level of parallelization requires care to avoid races, however, because two blocks in the same blockrow write to the same region within the output vector. 

These indices are relative to the block containing the particular element, not the entire matrix, and hence they range from 0 to β−1. 

Although not all work-stealing schedulers are space efficient, those maintaining the busy-leaves property [5] (e.g., as used in the Cilk work-stealing scheduler [4]) are space efficient.