How many nonzeros can be distributed in parallel?

If the nonzeros were guaranteed to be distributed evenly among block rows, then the simple blockrow parallelism would yield an efficient algorithm with n/β-way parallelism by simply performing a serial multiplication for each blockrow.

What is the Z-Morton ordering on nonzeros in each block?

The Z-Morton ordering on nonzeros in each block is equivalent to first interleaving the bits of row_ind and col_ind, and then sorting the nonzeros using these bit-interleaved values as the keys.

How do you get the mean/max values for CSB?

For CSB, the reported mean/max values are obtained by setting the block dimension β to be approximately √ n, so that they are comparable with statistics from CSC.

What is the CSB constructor considered to have balanced blockrows?

In other words, if max(nnz(Ai)) < 2 ·mean(nnz(Ai)), then the matrix is considered to have balanced blockrows and the optimization is applied.

What is the CSB constructor's order of the bitmasks?

The bitmasks are determined dynamically by the CSB constructor depending on the input matrix and the data type used for storing matrix indices.

What is the cost of converting to and from bit-interleaved integers?

Converting to and from bit-interleaved integers, however, is expensive with current hardware support,6 which would be necessary for the serial base case in lines 29–32.

What is the space of a work-stealing scheduler?

Although not all work-stealing schedulers are space efficient, those maintaining the busy-leaves property [5] (e.g., as used in the Cilk work-stealing scheduler [4]) are space efficient.

(Open Access) Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks (2009) | Aydin Buluc

Q: What have the authors contributed in "Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks" ?

This paper introduces a storage format for sparse matrices, called compressed sparse blocks ( CSB ), which allows both Ax and ATx to be computed efficiently in parallel, where A is an n× n sparse matrix with nnz ≥ n nonzeros and x is a dense n-vector.

Q: How many bits are required for each element in val?

For each element in val, the authors use lgβ bits to represent the row index and lgβ bits to represent the column index, requiring a total of nnz lgβ bits for each of row_ind and col_ind.

Q: What is the current standard storage format for sparse matrices in scientific computing?

The current standard storage format for sparse matrices in scientific computing, compressed sparse rows (CSR) [32], is more efficient, because it stores only n + nnz indices or pointers.

Q: What is the format for storing the nonzeros of each matrix row?

The compressed sparse row (CSR) format stores the nonzeros (and ideally only the nonzeros) of each matrix row in consecutive memory locations, and it stores an index to the first stored element of each row.

Q: What is the CSB constructor's order of the bitmasks?

The bitmasks are determined dynamically by the CSB constructor depending on the input matrix and the data type used for storing matrix indices.

Q: What is the cost of converting to and from bit-interleaved integers?

Converting to and from bit-interleaved integers, however, is expensive with current hardware support,6 which would be necessary for the serial base case in lines 29–32.

Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector

Multiplication Using Compressed Sparse Blocks

Aydın Buluc¸

∗

aydin@cs.ucsb.edu

Jeremy T. Fineman

†

jﬁneman@csail.mit.edu

Matteo Frigo

‡

matteo@cilk.com

John R. Gilbert

∗

gilbert@cs.ucsb.edu

Charles E. Leiserson

†‡

cel@mit.edu

∗

Dept.of Computer Science

University of California

Santa Barbara, CA 93106

†

MIT CSAIL

32 Vassar Street

Cambridge, MA 02139

‡

Cilk Arts, Inc.

55 Cambridge Street, Suite 200

Burlington, MA 01803

Abstract

This paper introduces a storage format for sparse matrices, called

compressed sparse blocks (CSB), which allows both Ax and A

to be computed efﬁciently in parallel, where A is an n ×n sparse

matrix with nnz ≥n nonzeros and x is a dense n-vector. Our algo-

rithms use Θ(nnz) work (serial running time) and Θ(

√

nlgn) span

(critical-path length), yielding a parallelism of Θ(nnz/

√

nlgn),

which is amply high for virtually any large matrix. The storage

requirement for CSB is esssentially the same as that for the more-

standard compressed-sparse-rows (CSR) format, for which com-

puting Ax in parallel is easy but A

x is difﬁcult. Benchmark results

indicate that on one processor, the CSB algorithms for Ax and A

run just as fast as the CSR algorithm for Ax, but the CSB algo-

rithms also scale up linearly with processors until limited by off-

chip memory bandwidth.

Categories and Subject Descriptors

F.2.1 [Analysis of Algorithms and Problem Complexity]: Nu-

merical Algorithms and Problems—computations on matrices; G.4

[Mathematics of Computing]: Mathematical Software—parallel

and vector implementations

General Terms

Algorithms, Design, Experimentation, Performance, Theory

Keywords

Compressed sparse blocks, compressed sparse columns, com-

pressed sparse rows, matrix transpose, matrix-vector multiplica-

tion, multithreaded algorithm, parallelism, span, sparse matrix,

storage format, work.

This work was supported in part by the National Science Foundation

under Grants 0540248, 0615215, 0712243, 0822896, and 0709385, and by

MIT Lincoln Laboratory under contract 7000012980.

1 Introduction

When multiplying a large n ×n sparse matrix A having nnz nonze-

ros by a dense n-vector x, the memory bandwidth for reading A

can limit overall performance. Consequently, most algorithms to

compute Ax store A in a compressed format. One simple “tuple”

representation involves storing each nonzero of A as a triple con-

sisting of its row index, its column index, and the nonzero value

itself. This representation, however, requires storing 2nnz row and

column indices, in addition to the nonzeros. The current standard

storage format for sparse matrices in scientiﬁc computing, com-

pressed sparse rows (CSR) [32], is more efﬁcient, because it stores

only n + nnz indices or pointers. This reduction in storage of CSR

compared with the tuple representation tends to result in faster se-

rial algorithms.

In the domain of parallel algorithms, however, CSR has its lim-

itations. Although CSR lends itself to a simple parallel algorithm

for computing the matrix-vector product Ax, this storage format

does not admit an efﬁcient parallel algorithm for computing the

product A

x, where A

denotes the transpose of the matrix A —

or equivalently, for computing the product x

A of a row vector x

by A. Although one could use compressed sparse columns (CSC)

to compute A

x, many applications, including iterative linear sys-

tem solvers such as biconjugate gradients and quasi-minimal resid-

ual [32], require both Ax and A

x. One could transpose A explicitly,

but computing the transpose for either CSR or CSC formats is ex-

pensive. Moreover, since matrix-vector multiplication for sparse

matrices is generally limited by memory bandwidth, it is desirable

to ﬁnd a storage format for which both Ax and A

x can be computed

in parallel without performing more than nnz fetches of nonzeros

from the memory to compute either product.

This paper presents a new storage format called compressed

sparse blocks (CSB) for representing sparse matrices. Like CSR

and CSC, the CSB format requires only n + nnz words of storage

for indices. Because CSB does not favor rows over columns or vice

versa, it admits efﬁcient parallel algorithms for computing either Ax

or A

x, as well as for computing Ax when A is symmetric and only

half the matrix is actually stored.

Previous work on parallel sparse matrix-vector multiplication

has focused on reducing communication volume in a distributed-

memory setting, often by using graph or hypergraph partitioning

techniques to ﬁnd good data distributions for particular matrices

( [7,38], for example). Good partitions generally exist for matrices

whose structures arise from numerical discretizations of partial dif-

ferential equations in two or three spatial dimensions. Our work, by

contrast, is motivated by multicore and manycore architectures, in

100

200

300

400

500

1 2 3 4 5 6 7 8

MFlops/sec

Processors

CSB_SpMV

CSB_SpMV_T

CSR_SpMV(Serial)

CSR_SpMV_T(Serial)

Star-P(y=Ax)

Star-P(y’=x’A)

Figure 1: Average performance of Ax and A

x operations on 13 different

matrices from our benchmark test suite. CSB

SpMV and CSB

SpMV

use compressed sparse blocks to perform Ax and A

x, respectively. CSR

SpMV (Serial) and CSR

SpMV

T (Serial) use OSKI [39] and compressed

sparse rows without any matrix-speciﬁc optimizations. Star-P (y=Ax) and

Star-P (y’=x’A) use Star-P [34], a parallel code based on CSR. The ex-

periments were run on a ccNUMA architecture powered by AMD Opteron

8214 (Santa Rosa) processors.

which parallelism and memory bandwidth are key resources. Our

algorithms are efﬁcient in these measures for matrices with arbi-

trary nonzero structure.

Figure 1 presents an overall summary of achieved performance.

The serial CSR implementation uses plain OSKI [39] without any

matrix-speciﬁc optimizations. The graph shows the average perfor-

mance over all our test matrices except for the largest, which failed

to run on Star-P [34] due to memory constraints. The performance

is measured in Mﬂops (Millions of FLoating-point OPerationS) per

second. Both Ax and A

x take 2 nnz ﬂops. To measure perfor-

mance, we divide this value by the time it takes for the computation

to complete. Section 7 provides detailed performance results.

The remainder of this paper is organized as follows. Section 2

discusses the limitations of the CSR/CSC formats for parallelizing

Ax and A

x calculations. Section 3 describes the CSB format for

sparse matrices. Section 4 presents the algorithms for computing

Ax and A

x using the CSB format, and Section 5 provides a theo-

retical analysis of their parallel performance. Section 6 describes

the experimental setup we used, and Section 7 presents the results.

Section 8 offers some concluding remarks.

2 Conventional storage formats

This section describes the CSR and CSC sparse-matrix storage for-

mats and explores their limitations when it comes to computing

both Ax and A

x in parallel. We review the work/span formulation

of parallelism and show that performing Ax with CSR (or equiva-

lently A

x with CSC) yields ample parallelism. We consider vari-

ous strategies for performing A

x in parallel with CSR (or equiva-

lently Ax with CSC) and why they are problematic.

The compressed sparse row (CSR) format stores the nonzeros

(and ideally only the nonzeros) of each matrix row in consecutive

memory locations, and it stores an index to the ﬁrst stored ele-

ment of each row. In one popular variant [14], CSR maintains one

ﬂoating-point array val[nnz] and two integer arrays, col

ind[nnz]

and row

ptr[

] to store the matrix A = (a

). The row

ptr array

stores the index of each row in val. That is, if val[k] stores matrix

element a

, then row

ptr[i] ≤ k < row

ptr[i + 1]. The col

ind ar-

ray stores the column indices of the elements in the val array. That

CSR

SPMV(A,x, y)

1 n ← A.rows

2 for i ← 0 to n−1 in parallel

3 do y[i] ← 0

4 for k ← A.row

ptr[i] to A.row

ptr[i + 1] −1

5 do y[i] ← y[i] + A. val[k]·x[A.col

ind[k]]

Figure 2: Parallel procedure for computing y ←Ax, where the n×n matrix

A is stored in CSR format.

is, if val[k] stores matrix element a

, then col

ind[k] = j.

The compressed sparse column (CSC) format is analogous to

CSR, except that the nonzeros of each column, instead of row, are

stored in contiguous memory locations. In other words, the CSC

format for A is obtained by storing A

in CSR format.

The earliest written description of CSR that we have been able

to divine from the literature is an unnamed “scheme” presented in

Table 1 of the 1967 article [36] by Tinney and Walker, although

in 1963 Sato and Tinney [33] allude to what is probably CSR.

Markowitz’s seminal paper [28] on sparse Gaussian elimination

does not discuss data structures, but it is likely that Markowitz used

such a format as well. CSR and CSC have since become ubiquitous

in sparse matrix computation [13,16,17,21,23,32].

The following lemma states the well-known bound on space used

by the index data in the CSR format (and hence the CSC format as

well). By index data, we mean all data other than the nonzeros —

that is, the row

ptr and col

ind arrays.

Lemma 1. The CSR format uses nlgnnz+ nnzlgn bits of index

data for an n×n matrix.

For a CSR matrix A, computing y ←Ax in parallel is straightfor-

ward, as shown in Figure 2. Procedure CSR

SPMV in the ﬁgure

computes each element of the output array in parallel, and it does

not suffer from race conditions, because each parallel iteration i

writes to a single location y[i] which is not updated by any other

iteration.

We shall measure the complexity of this code, and other codes in

this paper, in terms of work and span [10, Ch. 27]:

• The work, denoted by T

, is the running time on 1 processor.

• The span,

denoted by T

∞

, is running time on an inﬁnite num-

ber of processors.

The parallelism of the algorithm is T

∞

, which corresponds to

the maximum possible speedup on any number of processors. Gen-

erally, if a machine has somewhat fewer processors than the paral-

lelism of an application, a good scheduler should be able to achieve

linear speedup. Thus, for a ﬁxed amount of work, our goal is to

achieve a sufﬁciently small span so that the parallelism exceeds the

number of processors by a reasonable margin.

The work of CSR

SPMV is Θ(nnz), assuming, as we shall, that

nnz ≥ n, because the body of the outer loop starting in line 2 ex-

ecutes for n iterations, and the body of the inner loop starting in

line 4 executes for the number of nonzeros in the ith row, for a total

of nnz times.

The span of CSR

SPMV depends on the maximum number nr

of nonzeros in any row of the matrix A, since that number deter-

mines the worst-case time of any iteration of the loop in line 4.

The n iterations of the parallel loop in line 2 contribute Θ(lgn) to

the span, assuming that loops are implemented as binary recursion.

Thus, the total span is Θ(nr+ lgn).

The parallelism is therefore Θ(nnz/(nr+ lgn)). In many com-

mon situations, we have nnz = Θ(n), which we will assume for

estimation purposes. The maximum number nr of nonzeros in any

The literature also uses the terms depth [3] and critical-path length [4].

CSR

SPMV

T(A, x, y)

1 n ← A.cols

2 for i ← 0 to n−1

3 do y[i] ← 0

4 for i ← 0 to n−1

5 do for k ←A. row

ptr[i] to A.row

ptr[i + 1] −1

6 do y[A. col

ind[k]] ← y[A. col

ind[k]] + A. val[k]·x[i]

Figure 3: Serial procedure for computing y ←A

x, where the n ×n matrix

A is stored in CSR format.

row can vary considerably, however, from a constant, if all rows

have an average number of nonzeros, to n, if the matrix has a dense

row. If nr = O(1), then the parallelism is Θ(nnz/ lgn), which is

quite high for a matrix with a billion nonzeros. In particular, if we

ignore constants for the purpose of making a ballpark estimate, we

have nnz/lgn ≈10

/(lg10

) > 3×10

, which is much larger than

any number of processors one is likely to encounter in the near fu-

ture. If nr = Θ(n), however, as is the case when there is even a

single dense row, we have parallelism Θ(nnz/n) = Θ(1), which

limits scalability dramatically. Fortunately, we can parallelize the

inner loop (line 4) using divide-and-conquer recursion to compute

the sparse inner product in lg(nr) span without affecting the asymp-

totic work, thereby achieving parallelism Θ(nnz/ lgn) in all cases.

Computing A

x serially can be accomplished by simply inter-

changing the row and column indices [15], yielding the pseudocode

shown in Figure 3. The work of procedure CSR

SPMV

T is

Θ(nnz), the same as CSR

SPMV.

Parallelizing CSR

SPMV

T is not straightforward, however.

We shall review several strategies to see why it is problematic.

One idea is to parallelize the loops in lines 2 and 5, but this strat-

egy yields minimal scalability. First, the span of the procedure is

Θ(n), due to the loop in line 4. Thus, the parallelism can be at

most O(nnz/n), which is a small constant in most common situ-

ations. Second, in any practical system, the communication and

synchronization overhead for executing a small loop in parallel is

much larger than the execution time of the few operations executed

in line 6.

Another idea is to execute the loop in line 4 in parallel.

Unfortunately, this strategy introduces race conditions in the

read/modify/write to y[A.col

ind[k]] in line 6.

These races can

be addressed in two ways, neither of which is satisfactory.

The ﬁrst solution involves locking column col

ind[k] or using

some other form of atomic update.

This solution is unsatisfac-

tory because of the high overhead of the lock compared to the cost

of the update. Moreover, if A contains a dense column, then the

contention on the lock is Θ(n), which completely destroys any par-

allelism in the common case where nnz = Θ(n).

The second solution involves splitting the output array y into

multiple arrays y

in a way that avoids races, and then accumu-

lating y ← Σ

at the end of the computation. For example, in a

system with P processors (or threads), one could postulate that pro-

cessor p only operates on array y

, thereby avoiding any races. This

solution is unsatisfactory because the work becomes Θ(nnz+Pn),

where the last term comes from the need to initialize and accumu-

late P (dense) length-n arrays. Thus, the parallel execution time is

Θ((nnz+Pn)/P) = Ω(n) no matter how many processors are avail-

able.

A third idea for parallelizing A

x is to compute the transpose

In fact, if nnz > n, then the “pigeonhole principle” guarantees that the

program has at least one race condition.

No mainstream hardware supports atomic update of ﬂoating-point

quantities, however.

explicitly and then use CSR

SPMV. Unfortunately, parallel trans-

position of a sparse matrix in CSR format is costly and encounters

exactly the same problems we are trying to avoid. Moreover, ev-

ery element is accessed at least twice: once for the transpose, and

once for the multiplication. Since the calculation of a matrix-vector

product tends to be memory-bandwidth limited, this strategy is gen-

erally inferior to any strategy that accesses each element only once.

Finally, of course, we could store the matrix A

in CSR format,

that is, storing A in CSC format, but then computing Ax becomes

difﬁcult.

To close this section, we should mention that if the matrix A is

symmetric, so that only about half the nonzeros need be stored —

for example, those on or above the diagonal — then computing

Ax in parallel for CSR is also problematic. For this example, the

elements below the diagonal are visited in an inconvenient order,

as if they were stored in CSC format.

3 The CSB storage format

This section describes the CSB storage format for sparse matrices

and shows that it uses the same amount of storage space as the

CSR and CSC formats. We also compare CSB to other blocking

schemes.

For a given block-size parameter β, CSB partitions the n ×n

matrix A into n

/β

equal-sized β×β square blocks

A =







··· A

0,n/β−1

··· A

1,n/β−1

n/β−1,0

n/β−1,1

··· A

n/β−1,n/β−1







where the block A

is the β ×β submatrix of A containing el-

ements falling in rows iβ, iβ + 1,...,(i + 1)β −1 and columns

jβ, jβ + 1, . . . , ( j + 1)β −1 of A. For simplicity of presentation,

we shall assume that β is an exact power of 2 and that it divides n;

relaxing these assumptions is straightforward.

Many or most of the individual blocks A

are hypersparse [6],

meaning that the ratio of nonzeros to matrix dimension is asymp-

totically 0. For example, if β =

√

n and nnz = cn, the average block

has dimension

√

n and only c nonzeros. The space to store a block

should therefore depend only on its nonzero count, not on its di-

mension.

CSB represents a block A

by compactly storing a triple for each

nonzero, associating with the nonzero data element a row and col-

umn index. In contrast to the column index stored for each nonzero

in CSR, the row and column indices lie within the submatrix A

and hence require fewer bits. In particular, if β =

√

n, then each

index into A

requires only half the bits of an index into A. Since

these blocks are stored contiguously in memory, CSB uses an aux-

iliary array of pointers to locate the beginning of each block.

More speciﬁcally, CSB maintains a ﬂoating-point array

val[nnz], and three integer arrays row

ind[nnz], col

ind[nnz], and

blk

ptr[n

/β

]. We describe each of these arrays in turn.

The val array stores all the nonzeros of the matrix and is anal-

ogous to CSR’s array of the same name. The difference is that

CSR stores rows contiguously, whereas CSB stores blocks con-

tiguously. Although each block must be contiguous, the ordering

among blocks is ﬂexible. Let f (i, j) be the bijection from pairs of

block indices to integers in the range 0,1,...,n

/β

−1 that de-

scribes the ordering among blocks. That is, f(i, j) < f (i

′

, j

′

) if and

The CSB format may be easily extended to nonsquare n×m matrices.

In this case, the blocks remain as square β×β matrices, and there are nm/β

blocks.

only if A

appears before A

′

in val. We discuss choices of order-

ing later in this section.

The row

ind and col

ind arrays store the row and column in-

dices, respectively, of the elements in the val array. These indices

are relative to the block containing the particular element, not the

entire matrix, and hence they range from 0 to β−1. That is, if val[k]

stores the matrix element a

iβ+r, jβ+c

, which is located in the rth row

and cth column of the block A

, then row

ind = r and col

ind = c.

As a practical matter, we can pack a corresponding pair of elements

of row

ind and col

ind into a single integer word of 2lgβ bits so

that they make a single array of length nnz, which is comparable to

the storage needed by CSR for the col

ind array.

The blk

ptr array stores the index of each block in the val array,

which is analogous to the row

ptr array for CSR. If val[k] stores

a matrix element falling in block A

, then blk

ptr[ f(i, j)] ≤ k <

blk

ptr[ f(i, j)+ 1].

The following lemma states the storage used for indices in the

CSB format.

Lemma 2. The CSB format uses (n

/β

)lgnnz+2nnzlgβ bits of

index data.

Proof. Since the val array contains nnz elements, referencing

an element requires lgnnz bits, and hence the blk

ptr array uses

/β

)lgnnz bits of storage.

For each element in val, we use lgβ bits to represent the row

index and lgβ bits to represent the column index, requiring a total

of nnzlgβ bits for each of row

ind and col

ind. Adding the space

used by all three indexing arrays completes the proof.

To better understand the storage requirements of CSB, we

present the following corollary for β =

√

n. In this case, both CSR

(Lemma 1) and CSB use the same storage.

Corollary 3. The CSB format uses nlgnnz+nnzlgn bits of index

data when β =

√

Thus far, we have not addressed the ordering of elements within

each block or the ordering of blocks. Within a block, we use a Z-

Morton ordering [29], storing ﬁrst all those elements in the top-left

quadrant, then the top-right, bottom-left, and ﬁnally bottom-right

quadrants, using the same layout recursively within each quadrant.

In fact, these quadrants may be stored in any order, but the recursive

ordering is necessary for our algorithm to achieve good parallelism

within a block.

The choice of storing the nonzeros within blocks in a recursive

layout is opposite to the common wisdom for storing dense matri-

ces [18]. Although most compilers and architectures favor conven-

tional row/column ordering for optimal prefetching, the choice of

layout within the block becomes less signiﬁcant for sparse blocks

as they already do not take full advantage of such features. More

importantly, a recursive ordering allows us to efﬁciently determine

the four quadrants of a block using binary search, which is crucial

for parallelizing individual blocks.

Our algorithm and analysis do not, however, require any particu-

lar ordering among blocks. A Z-Morton ordering (or any recursive

ordering) seems desirable as it should get better performance in

practice by providing spatial locality, and it matches the ordering

within a block. Computing the function f (i, j), however, is simpler

for a row-major or column-major ordering among blocks.

Comparison with other blocking methods

A blocked variant of CSR, called BCSR, has been used for im-

proving register reuse [24]. In BCSR, the sparse matrix is divided

into small dense blocks that are stored in consecutive memory loca-

tions. The pointers are maintained to the ﬁrst block on each row of

blocks. BCSR storage is converse to CSB storage, because BCSR

stores a sparse collection of dense blocks, whereas CSB stores a

dense collection of sparse blocks. We conjecture that it would be

advantageous to apply BCSR-style register blocking to each indi-

vidual sparse block of CSB.

Nishtala et al. [30] have proposed a data structure similar to CSB

in the context of cache blocking. Our work differs from theirs in

two ways. First, CSB is symmetric without favoring rows over

columns. Second, our algorithms and analysis for CSB are de-

signed for parallelism instead of cache performance. As shown

in Section 5, CSB supports ample parallelism for algorithms com-

puting Ax and A

x, even on sparse and irregular matrices.

Blocking is also used in dense matrices. The Morton-hybrid lay-

out [1,27], for example, uses a parameter equivalent to our param-

eter β for selecting the block size. Whereas in CSB we store ele-

ments in a Morton ordering within blocks and an arbitrary ordering

among blocks, the Morton-hybrid layout stores elements in row-

major order within blocks and a Morton ordering among blocks.

The Morton-hybrid layout is designed to take advantage of hard-

ware and compiler optimizations (within a block) while still ex-

ploiting the cache beneﬁts of a recursive layout. Typically the block

size is chosen to be 32×32, which is signiﬁcantly smaller than the

Θ(

√

n) block size we propose for CSB. The Morton-hybrid lay-

out, however, considers only dense matrices, for which designing

a matrix-vector multiplication algorithm with good parallelism is

signiﬁcantly easier.

4 Matrix-vector multiplication using CSB

This section describes a parallel algorithm for computing the

sparse-matrix dense-vector product y ← Ax, where A is stored in

CSB format. This algorithm can be used equally well for comput-

ing y ← A

x by switching the roles of row and column. We ﬁrst

give an overview of the algorithm and then describe it in detail.

At a high level, the CSB

SPMV multiplication algorithm sim-

ply multiplies each “blockrow” by the vector x in parallel, where

the ith blockrow is the row of blocks (A

···A

i,n/β−1

). Since

each blockrow multiplication writes to a different portion of the

output vector, this part of the algorithm contains no races due to

write conﬂicts.

If the nonzeros were guaranteed to be distributed evenly among

block rows, then the simple blockrow parallelism would yield an

efﬁcient algorithm with n/β-way parallelism by simply performing

a serial multiplication for each blockrow. One cannot, in general,

guarantee that distribution of nonzeros will be so nice, however. In

fact, sparse matrices in practice often include at least one dense row

containing roughly n nonzeros, whereas the number of nonzeros

is only nnz ≈ cn for some small constant c. Thus, performing a

serial multiplication for each blockrow yields no better than c-way

parallelism.

To make the algorithm robust to matrices of arbitrary nonzero

structure, we must parallelize the blockrow multiplication when a

blockrow contains “too many” nonzeros. This level of paralleliza-

tion requires care to avoid races, however, because two blocks in

the same blockrow write to the same region within the output vec-

tor. Speciﬁcally, when a blockrow contains Ω(β) nonzeros, we re-

cursively divide it “in half,” yielding two subblockrows, each con-

taining roughly half the nonzeros. Although each of these sub-

blockrows can be multiplied in parallel, they may need to write to

the same region of the output vector. To avoid the races that might

arise due to write conﬂicts between the subblockrows, we allocate a

temporary vector to store the result of one of the subblockrows and

allow the other subblockrow to use the output vector. After both

subblockrow multiplications complete, we serially add the tempo-

rary vector into the output vector.

To facilitate fast subblockrow divisions, we ﬁrst partition the

blockrow into “chunks” of consecutive blocks, each containing at

CSB

SPMV(A,x, y)

1 for i ← 0 to n/β−1 in parallel // For each blockrow.

2 do Initialize a dynamic array R

3 R

[0] ← 0

4 count ←0 // Count nonzeroes in chunk.

5 for j ← 0 to n/β−2

6 do count ← count+nnz(A

)

7 if count+ nnz(A

i, j+1

) > Θ(β)

8 then // End the chunk, since the next block

// makes it too large.

9 append j to R

// Last block in chunk.

10 count ←0

11 append n/β−1 to R

12 CSB

BLOCKROWV(A,i,R

,x,y[iβ..(i+ 1)β−1])

Figure 4: Pseudocode for the matrix-vector multiplication y ← Ax. The

procedure CSB

BLOCKROWV (pseudocode for which can be found in

Figure 5) as called here multiplies the blockrow by the vector x and writes

the output into the appropriate region of the output vector y. The nota-

tion x[a..b] means the subarray of x starting at index a and ending at in-

dex b. The function nnz(A

) is a shorthand for A.blk

ptr[ f (i, j) + 1] −

A.blk

ptr[ f (i, j)], which calculates the number of nonzeros in the block A

For conciseness, we have overloaded the Θ(β) notation (in line 7) to mean

“a constant times β”; any constant sufﬁces for the analysis, and we use the

constant 3 in our implementation.

most O(β) nonzeros (when possible) and Ω(β) nonzeros on aver-

age. The lower bound of Ω(β) will allow us to amortize the cost

of writing to the length-β temporary vector against the nonzeros in

the chunk. By dividing a blockrow “in half,” we mean assigning to

each subblockrow roughly half the chunks.

Figure 4 gives the top-level algorithm, performing each block-

row vector multiplication in parallel. The “for...in parallel do”

construct means that each iteration of the for loop may be executed

in parallel with the others. For each loop iteration, we partition

the blockrow into chunks in lines 2–11 and then call the blockrow

multiplication in line 12. The array R

stores the indices of the

last block in each chunk; speciﬁcally, the kth chunk, for k > 0, in-

cludes blocks (A

i,R

[k−1]+1

i,R

[k−1]+2

···A

i,R

[k]

). A chunk consists

of either a single block containing Ω(β) nonzeros, or it consists of

many blocks containing O(β) nonzeros in total. To compute chunk

boundaries, just iterate over blocks (in lines 5–10) until enough

nonzeros are accrued.

Figure 5 gives the parallel algorithm CSB

BLOCKROWV for

multiplying a blockrow by a vector, writing the result into the

length-β vector y. In lines 22–29, the algorithm recursively di-

vides the blockrow such that each half receives roughly the same

number of chunks. We ﬁnd the appropriate middles of the chunk

array R and the input vector x in lines 22 and 23, respectively. We

then allocate a length-β temporary vector z (line 24) and perform

the recursive multiplications on each subblockrow in parallel (lines

25–27), having one of the recursive multiplications write its output

to z. When these recursive multiplications complete, we merge the

outputs into the vector y (lines 28–29).

The recursion bottoms out when the blockrow consists of a sin-

gle chunk (lines 12–21). If this chunk contains many blocks, it is

guaranteed to contain at most Θ(β) nonzeros, which is sufﬁciently

sparse to perform the serial multiplication in line 20. If, on the

other hand, the chunk is a single block, it may contain as many as

≈ n nonzeros. A serial multiplication here, therefore, would be

the bottleneck in the algorithm. Instead, we perform the parallel

block-vector multiplication CSB

BLOCKV in line 18.

If the blockrow recursion reaches a single block, we perform a

parallel multiplication of the block by the vector, given in Figure 6.

CSB

BLOCKROWV(A,i,R,x,y)

11 if R.length = 2 // The subblockrow is a single chunk.

12 then ℓ ← R[0] + 1 // Leftmost block in chunk.

13 r ← R[1] // Rightmost block in chunk.

14 if ℓ = r

15 then // The chunk is a single (dense) block.

16 start ←A.blk

ptr[ f(i, ℓ)]

17 end ← A.blk

ptr[ f(i, ℓ) + 1] −1

18 CSB

BLOCKV(A,start, end,β,x,y)

19 else // The chunk is sparse.

20 multiply y ←(A

iℓ

i,ℓ+1

···A

)x serially

21 return

// Since the block row is “dense,” split it in half.

22 mid ←

⌈

R.length/2

⌉

−1 // Divide chunks in half.

// Calculate the dividing point in the input vector x.

23 xmid ← β·(R[mid] −R[0])

24 allocate a length-β temporary vector z, initialized to 0

25 in parallel

26 do CSB

BLOCKROWV(A,i,R[0..mid],x[0..xmid−1],y)

27 do CSB

BLOCKROWV(A,i,R[mid..R.length−1],

x[xmid..x.length−1],z)

28 for k ← 0 to β−1

29 do y[k] ← y[k]+ z[k]

Figure 5: Pseudocode for the subblockrow vector product y ← (A

iℓ

i,ℓ+1

···A

)x. The in parallel do. ..do. .. construct indicates that all of the do

code blocks may execute in parallel. The procedure CSB

BLOCKV (pseu-

docode for which can be found in Figure 6) calculates the product of the

block and the vector in parallel.

The block-vector multiplication proceeds by recursively dividing

the (sub)block M into quadrants M

, M

, and M

, each of

which is conveniently stored contiguously in the Z-Morton-ordered

val, row

ind, and col

ind arrays between indices start and end. We

perform binary searches to ﬁnd the appropriate dividing points in

the array in lines 34–36.

To understand the pseudocode, consider the search for the divid-

ing point s

between M

and M

. For any recursively

chosen dim×dim matrix M, the column indices and row indices

of all elements have the same leading lgβ −lgdim bits. Moreover,

for those elements in M

, the next bit in the row index is a

0, whereas for those in elements in M

, the next bit in the

row index is 1. The algorithm does a binary search for the point at

which this bit ﬂips. The cases for the dividing point between M

and M

or M

and M

are similar, except that we focus on the

column index instead of the row index.

After dividing the matrix into quadrants, we execute the matrix

products involving matrices M

and M

in parallel (lines 37–39),

as they do not conﬂict on any outputs. After completing these prod-

ucts, we execute the other two matrix products in parallel (lines

40–42).

This procedure resembles a standard parallel divide-and-

conquer matrix multiplication, except that our base case of serial

multiplication starts at a matrix containing Θ(dim) nonzeros (lines

29–32). Note that although we pass the full length-β arrays x and

y to each recursive call, the effective length of each array is halved

implicitly by partitioning M into quadrants. Passing the full arrays

is a technical detail required to properly compute array indices, as

the indices A.row

ind and A.col

ind store offsets within the block.

The CSB

SPMV

T algorithm is identical to CSB

SPMV, ex-

cept that we operate over blockcolumns rather than blockrows.

The algorithm may instead do M

and M

in parallel followed by M

and M

in parallel without affecting the performance analysis. Presenting

the algorithm with two choices may yield better load balance.

Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks

Figures

Citations

The Combinatorial BLAS: design, implementation, and applications

The tensor algebra compiler

CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication

Data reorganization in memory using 3D-stacked DRAM

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi

References

Introduction to Algorithms

Introduction to Algorithms

Iterative Methods for Sparse Linear Systems

The C++ Programming Language

Introduction to algorithms: 4. Turtle graphics

Related Papers (5)

The university of Florida sparse matrix collection

OSKI: A Library of Automatically Tuned Sparse Matrix Kernels

Implementing sparse matrix-vector multiplication on throughput-oriented processors

Sparsity: Optimization Framework for Sparse Matrix Kernels

Iterative Methods for Sparse Linear Systems

Frequently Asked Questions (13)

Q1. What have the authors contributed in "Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks" ?

Q2. How many bits are required for each element in val?

Q3. How many nonzeros can be distributed in parallel?

Q4. What is the Z-Morton ordering on nonzeros in each block?

Q5. What is the current standard storage format for sparse matrices in scientific computing?

Q6. What is the format for storing the nonzeros of each matrix row?

Q7. How do you get the mean/max values for CSB?

Q8. What is the CSB constructor considered to have balanced blockrows?

Q9. What is the CSB constructor's order of the bitmasks?

Q10. What is the cost of converting to and from bit-interleaved integers?

Q11. What is the level of parallelization required to avoid races?

Q12. What is the order of the indices in the val array?

Q13. What is the space of a work-stealing scheduler?