scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Parallelization of reordering algorithms for bandwidth and wavefront reduction

TL;DR: This paper presents the first parallel implementations of two widely used reordering algorithms: Reverse Cut hill-McKee (RCM) and Sloan and achieves a speedup over sequential HSL-Sloan.
Abstract: Many sparse matrix computations can be speeded up if the matrix is first reordered. Reordering was originally developed for direct methods but it has recently become popular for improving the cache locality of parallel iterative solvers since reordering the matrix to reduce bandwidth and wave front can improve the locality of reference of sparse matrix-vector multiplication (SpMV), the key kernel in iterative solvers. In this paper, we present the first parallel implementations of two widely used reordering algorithms: Reverse Cut hill-McKee (RCM) and Sloan. On 16 cores of the Stampede supercomputer, our parallel RCM is 5.56 times faster on the average than a state-of-the-art sequential implementation of RCM in the HSL library. Sloan is significantly more constrained than RCM, but our parallel implementation achieves a speedup of 2.88X on the average over sequential HSL-Sloan. Reordering the matrix using our parallel RCM and then performing 100 SpMV iterations is twice as fast as using HSL-RCM and then performing the SpMV iterations, it is also 1.5 times faster than performing the SpMV iterations without reordering the matrix.

Summary (4 min read)

Introduction

  • The authors present the first parallel implementations of two widely used reordering algorithms: Reverse Cuthill-McKee (RCM) and Sloan.
  • Reordering the matrix using their parallel RCM and then performing 100 SpMV iterations is twice as fast as using HSL-RCM and then performing the SpMV iterations; it is also 1.5 times faster than performing the SpMV iterations without reordering the matrix.
  • In all these applications, reordering is performed sequentially even if the sparse matrix computation is performed in parallel.
  • Since reordering strategies like RCM and Sloan are heuristics, the authors allow their parallel implementation of a reordering algorithm to produce a reordering that may differ slightly from the one produced by the sequential algorithm, if this improves parallel performance.

II. BACKGROUND

  • The bandwidth of a matrix is the maximum row width, where the width of row i is defined as the difference between the column index of the first and the last non-zero elements in row i.
  • If the matrix is symmetric, then the bandwidth is the semi-bandwidth of the matrix, which is the maximum distance from the diagonal.
  • While few matrices are banded by default, in many cases the bandwidth of a matrix can be reduced by permuting or renumbering its rows and columns.
  • Reducing bandwidth is usually applied as a preprocessing step for sparse matrix-vector multiplication [39] and some preconditioners for iterative methods [7].
  • Then, the authors describe breadth-first search (BFS) (Section II-B), an important subroutine to both algorithms.

A. Galois Programming Model

  • The Galois system is a library and runtime system to facilitate parallelization of irregular algorithms, in particular, graph algorithms [15, 27].
  • The Galois system adds two concepts to a traditional sequential programming mode: ordered and unordered set iterators.
  • The new elements will be processed before the loop finishes.
  • An ordered set iterator is like an unordered set iterator with an additional restriction: the serialization of iterations must be consistent with a user-defined ordering relation on iterations.
  • In the following sections, the authors introduce algorithms using ordered set iterators for simplicity but their parallelizations in Section III reformulate the algorithms into unordered forms for performance.

E. Choosing Source Nodes

  • Empirical data shows that the quality of reordering algorithms is highly influenced by the nodes chosen as the source for BFS and RCM or the source and end for Sloan [14, 41, 43].
  • For this work, the authors use the algorithm described by Kumfert [29] that computes a pair of nodes that lie on the pseudo-diameter, called pseudo-peripheral nodes.
  • The diameter of a graph is the maximum distance between any two nodes.
  • Pseudo-peripheral nodes, therefore, are a pair of nodes that are “far away” from each other in the graph.
  • For BFS and RCM reordering, the authors pick one element from the pair to be the source.

A. BFS

  • As mentioned in Section II-B, a possible parallel implementation is the ordered BFS algorithm (see Algorithm 1).
  • The authors describe how to take the output of the unordered BFS algorithm and generate, in parallel, a permutation by choosing one that is consistent with the levels generated by the unordered BFS.
  • In their algorithm, the authors compute the histograms locally for each thread and then sum them together.
  • Calculating prefix sums in parallel is well-known.
  • This is done by dividing the nodes between threads and setting each node’s position in the permutation to the next free position for the node’s level.

IV. EXPERIMENTAL RESULTS

  • The authors describe the evaluation of their leveled and unordered RCM and Sloan algorithms with well-known third-party implementations from the HSL mathematical software library [23].
  • Section IV-C evaluates both execution time and reordering quality across a suite of sparse matrices.
  • The selected matrices are shown in Table I.
  • All matrices are square and have only one strongly connected component.

A. Methodology

  • The authors evaluate four parallel reordering algorithms: BFS in Algorithm 4 using the unordered BFS described in Section II-B; two RCM algorithms, the leveled RCM in Section III-B1 and the unordered RCM in Section III-B2; and Sloan in Section III-C. collection of Fortran codes for large-scale scientific computations.
  • To compute the source nodes for reordering, the authors use the algorithm by Kumfert [29] described in Section II-E.
  • The authors noticed that the nodes produced are usually different.
  • Each node of the cluster runs CentOS Linux, version 6.3 and consists of 2 Xeon Processors of 8 cores each, operating at 2.7GHz.

B. Reordering Quality

  • Tables II and III show the bandwidth2 and root mean square (RMS) of the wavefront, respectively.
  • For each metric, Tables II and III show the initial matrix and the values after applying the permutation.
  • Table II shows that HSL’s sequential RCM and their parallel RCM produce very similar bandwidth numbers.
  • The BFS reordering generally produces worse bandwidth, and standard deviation shows that the variation due to the non-deterministic results is usually not significant.
  • Table III shows RMS wavefront for their Sloan implementation when running in parallel.

C. Reordering Performance

  • The authors compare the execution times of different reordering algorithms.
  • The other matrices have a lower number of nonzeros per row (less than 12 in all cases).
  • Thus, one reason for the better performance of unordered RCM is that the heuristics used in RCM naturally lead to BFS with a large number of levels, which favors unordered traversal over ordered ones.
  • The “only reordering” column in Table IV shows how much time is spent in each HSL program excluding the computation of these nodes, and Figure 2 shows the speedup of the different algorithms when only the execution time of the reordering algorithm itself is considered.
  • The speedups with the pseudo-diameter computation are 4.12X, 3.89X and 2.74X respectively.

D. End-to-end Performance

  • The ultimate goal of reordering is to improve the performance of subsequent matrix operations.
  • Therefore, in this section, the authors show how local reordering will affect the local part of the SpMV computation.
  • Table VI shows the times to run a hundred (100) SpMV iterations with matrices obtained by the different reorderings, using the implementation in the PETSc library and utilizing the 16 cores of a cluster node3.
  • Using their parallel RCM reordering, the time to perform 100 SpMV iterations is reduced by 1.5X compared to performing the SpMV iterations without reordering, and it is reduced by 2X compared to using HSL RCM for reordering.
  • To quantify this improvement in cache reuse, the authors measured the number of cache misses for SpMV using the matrices obtained with the different reorderings.

VI. CONCLUSIONS

  • Many sparse matrix computations benefit from reordering the sparse matrix before performing the computation.
  • To their knowledge, these are the first such parallel implementations.
  • Since both these reorderings are heuristics, the authors allow the parallel implementations to produce slightly different reorderings than those produced by the sequential implementations of these algorithms.
  • Since this can affect the SpMV time, the impact on the overall time to solution has to determined experimentally.
  • Reordering the matrix using their parallel RCM and then performing 100 SpMV iterations is twice as fast as using HSL-RCM and then performing the SpMV iterations; it is also 1.5 times faster than performing the SpMV iterations without reordering the matrix.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Parallelization of Reordering Algorithms
for Bandwidth and Wavefront Reduction
Konstantinos I. Karantasis
, Andrew Lenharth
, Donald Nguyen
, Mar
´
ıa J. Garzar
´
an
, Keshav Pingali
,
Department of Computer Science,
University of Illinois at Urbana-Champaign
{kik, garzaran}@illinois.edu
Institute for Computational Engineering and Sciences and
Department of Computer Science,
University of Texas at Austin
lenharth@ices.utexas.edu, {ddn, pingali}@cs.utexas.edu
Abstract—Many sparse matrix computations can be speeded
up if the matrix is first reordered. Reordering was originally
developed for direct methods but it has recently become popular
for improving the cache locality of parallel iterative solvers
since reordering the matrix to reduce bandwidth and wavefront
can improve the locality of reference of sparse matrix-vector
multiplication (SpMV), the key kernel in iterative solvers.
In this paper, we present the first parallel implementations of
two widely used reordering algorithms: Reverse Cuthill-McKee
(RCM) and Sloan. On 16 cores of the Stampede supercomputer,
our parallel RCM is 5.56 times faster on the average than a
state-of-the-art sequential implementation of RCM in the HSL
library. Sloan is significantly more constrained than RCM, but
our parallel implementation achieves a speedup of 2.88X on the
average over sequential HSL-Sloan. Reordering the matrix using
our parallel RCM and then performing 100 SpMV iterations is
twice as fast as using HSL-RCM and then performing the SpMV
iterations; it is also 1.5 times faster than performing the SpMV
iterations without reordering the matrix.
I. INTRODUC TION
Many sparse matrix computations can be speeded up by
reordering the sparse matrix before performing the matrix
computations. The classical examples are direct methods for
solving sparse linear systems: in a sequential implementation,
reordering the matrix before factorization can reduce fill,
thereby reducing the space and time required to perform the
factorization [14, 16]. In parallel implementations of direct
methods, reordering can reduce the length of the critical
path through the program, thereby improving parallel perfor-
mance [11]. Reordering can also reduce the amount of storage
required to represent the matrix when certain formats like
banded and skyline representations are used [16].
In most applications, computing the optimal reordering is
NP-complete; for example, Papadimitriou showed that reorder-
ing to minimize bandwidth is NP-complete [38]. Therefore,
heuristic reordering algorithms are used in practice, such
as minimum-degree, Cuthill-McKee (CM), Reverse Cuthill-
McKee (RCM), Sloan, and nested dissection orderings [12, 16,
26, 33, 43]. Some of these reordering methods are described
in more detail in Section II.
More recently, reordering has become popular even in the
context of iterative sparse solvers where problems like mini-
mizing fill do not arise. The key computation in an iterative
sparse solver is sparse matrix-vector multiplication (SpMV)
(say y = Ax). If the matrix is stored in compressed row-
storage (CRS) and the SpMV computation is performed by
rows, the accesses to y and A enjoy excellent locality, but the
accesses to x may not. One way to improve the locality of
accesses to the elements of x is to reorder the sparse matrix
A using a bandwidth-reducing ordering (RCM is popular). In
this context, the purpose of a reordering technique like RCM
is to promote cache reuse while a single core (or a set of
cores cooperatively sharing some level of cache) perform the
computation.
When a large cluster is used to perform the SpMV computa-
tion, the matrix is either assembled in a distributed manner [17]
or a partitioner like ParMetis is used to partition the matrix
among the hosts, and each partition is then reordered using a
technique like RCM to promote locality. This use of reordering
was first popularized in the Gordon Bell prize-winning paper
at SC1999 by Gropp et al. [2, 20]. Local reordering with
algorithms like RCM are now used widely in parallel iterative
solvers.
In all these applications, reordering is performed sequen-
tially even if the sparse matrix computation is performed in
parallel. In the traditional use-case of reordering for sparse
direct methods, this is reasonable because reordering takes
a very small fraction of the overall execution time, which
is dominated by the cost of numerical factorization [16].
However, when reordering is used to improve cache locality
for sparse iterative methods, reordering time can be roughly
the same order of magnitude as the SpMV computation.
Therefore, it is useful to parallelize the reordering algorithms
too.
Parallelizing reordering algorithms like RCM or Sloan is
very challenging because, although there is parallelism in these
algorithms, the parallelism does not fall into the simple data-
parallelism pattern that is supported by existing frameworks
such as OpenMP and MPI. In fact, these algorithms belong to
SC14, November 16-21, 2014, New Orleans, Louisiana, USA
978-1-4799-5500-8/14/$31.00
c
2014 IEEE

a complex class of algorithms known as irregular algorithms
in the compiler literature [40]. These algorithms exhibit a
complex pattern of parallelism known as amorphous data-
parallelism, which must be found and exploited at runtime
because dependences between parallel computations are func-
tions of runtime data values. These complexities have deterred
previous researchers from attempting to parallelize reordering
algorithms like RCM.
In this paper, we describe the first ever parallelizations of
two popular reordering algorithms, Reverse Cuthill-McKee
(RCM) [12, 16] and Sloan [43]. Both are based on performing
traversals over the graph representation of a sparse matrix. We
use the Galois system to reduce the programming burden [15].
Since reordering strategies like RCM and Sloan are heuristics,
we allow our parallel implementation of a reordering algorithm
to produce a reordering that may differ slightly from the one
produced by the sequential algorithm, if this improves parallel
performance. Since this may impact the SpMV performance,
this flexibility has to be studied experimentally to determine
the impact of parallel reordering on the overall execution time.
Because we are interested only in reordering matrices within
a single host to improve the cache performance of SpMV,
we restrict ourselves to single host studies. Our parallel
RCM implementation obtains similar quality results as those
from a state-of-the-art sequential implementation in the HSL
library [23] and achieves speedups ranging between 3.18X to
8.92X on 16 cores with an average improvement of 5.56X
across a suite of sparse matrices. Reordering the matrix using
our parallel RCM and then performing 100 SpMV iterations
is twice as fast as using HSL-RCM and then performing the
SpMV iterations; it is also 1.5 times faster than performing
the SpMV iterations without reordering the matrix.
The rest of this paper is organized as follows. Section II
presents the background on the reordering mechanisms and
Galois, the programming model that we use to implement
our parallel algorithms. Section III describes the parallel
implementations of BFS, RCM, and Sloan. Section IV presents
the experimental results. Section V discusses the related work,
and Section VI concludes.
II. BACKGROUND
The bandwidth of a matrix is the maximum row width,
where the width of row i is defined as the difference between
the column index of the first and the last non-zero elements
in row i. If the matrix is symmetric, then the bandwidth is the
semi-bandwidth of the matrix, which is the maximum distance
from the diagonal. In a matrix with bandwidth k, all the non-
zeros are at most k entries from the diagonal.
Reordering mechanisms can be applied to reduce the band-
width of a matrix. While few matrices are banded by default,
in many cases the bandwidth of a matrix can be reduced by
permuting or renumbering its rows and columns. The original
structure of the matrix can be recovered by inverting the
permutation.
Reducing bandwidth is usually applied as a preprocessing
step for sparse matrix-vector multiplication [39] and some
preconditioners for iterative methods [7]. Additionally, gather-
ing or clustering non-zeros around the diagonal can be useful
in graph visualization [24, 36]. However, for direct methods
such as Cholesky or LU factorization, bandwidth reduction
may actually hurt performance since it can increase fill. Some
reordering strategies for direct methods try to reduce the
wavefront of the matrix, where the wavefront for an index
i of a matrix is defined as the set of rows that have non-zeros
in the submatrix consisting of the first i columns of the matrix
and rows i to N [28].
In this paper, we consider parallelizations of two popular
reordering algorithms, Reverse Cuthill-McKee (RCM) [12],
which attempts to reduce the bandwidth of symmetric matri-
ces, and Sloan [43, 44], which tries to minimize the wavefront.
Next, we describe Galois, the library and runtime sytem that
we use to implement our algorithms (Section II-A). Then, we
describe breadth-first search (BFS) (Section II-B), an impor-
tant subroutine to both algorithms. In the subsequent sections,
we describe the sequential versions of RCM (Section II-C)
and Sloan (Section II-D).
A. Galois Programming Model
The Galois system is a library and runtime system to
facilitate parallelization of irregular algorithms, in particular,
graph algorithms [15, 27]. The Galois system adds two con-
cepts to a traditional sequential programming mode: ordered
and unordered set iterators. Set iterators are indicated by the
foreach construct in our figures. Set iterators in Galois differ
from conventional set iterators in that new elements can be
added to a set while it is being iterated over. Apart from that,
an unordered set iterator behaves like an iterator in traditional
programming languages except that there is no specific order
that elements of a set are traversed and that iterations can add
new elements to the set during traversal. The new elements
will be processed before the loop finishes.
An unordered set iteration can be parallelized by process-
ing multiple iterations at once and checking if concurrent
iterations access the same data. Conflicting iterations can be
rolled back and tried again later. This process will generate
some serialization of iterations, which is a correct execution
according to the definition of the unordered set iterator. There
are various optimizations that the Galois system applies to
reduce the overhead of speculative parallelization, and in
practice, it is able to achieve good speedup for many irregular
algorithms [40].
When iterations of an unordered loop statically have no
dependences on each other nor are any new elements added,
the loop is equivalent to parallel DOALL.
An ordered set iterator is like an unordered set iterator with
an additional restriction: the serialization of iterations must be
consistent with a user-defined ordering relation on iterations.
An ordered set iterator is a generalization of thread-level
speculation or DOACROSS loops [37]. In those approaches,
the ordering relation is the sequential iteration order of a
loop. An ordered set iterator, on the other hand, allows
for user-defined ordering. Supporting the general case can

Algorithm 1 Ordered BFS Algorithm
1 Graph G = input() // Read in graph
2 Worklist wl = { source }
3 source.level = 0, nextId = 0
4 P[nextId++] = source
5
6 foreach (Node n: wl) orderedby (n.level) {
7 for (Node v: G.neighbors(n)) {
8 if (v.level > n.level + 1) {
9 v.level = n.level + 1
10 P[nextId++] = v
11 wl.push(v)
12 } } }
have significant runtime and synchronization overhead. One
important optimization is to break down ordered set iterators
into sequences of unordered set iterators when possible. In the
following sections, we introduce algorithms using ordered set
iterators for simplicity but our parallelizations in Section III
reformulate the algorithms into unordered forms for perfor-
mance.
B. Breadth-First Search
Breadth-first search (BFS) is a common primitive in reorder-
ing algorithms, where it is used either as a subroutine or as
an algorithmic skeleton for more complicated graph traversals.
Algorithm 1 shows the algorithm as it would be formulated
in the Galois system using an ordered set iterator. We call
this the ordered BFS algorithm. The body of the outer loop is
the operator. It is applied to nodes in a worklist. Initially, the
worklist contains a single element, the source node from which
to start the BFS. Each node has a level value that indicates
its distance from the source. Level values are initialized with
a value that is greater than any other level value. Worklist
elements are processed in ascending level order. The BFS
parent of a node n is the neighbor of n that first updated
the level of n. The source node of the BFS has no parent.
BFS itself can be used as a reordering algorithm, in which
case, the permutation it generates is simply the order that
nodes are visited. In the algorithm, this is recorded in the
array P .
The ordering relation in Algorithm 1 has two important
properties: (i) level values are generated and processed in
monotonically increasing order, and (ii) a nodes with level
l adds nodes to the worklist with level l + 1 only. This order
produces a work-efficient BFS implementation but constrains
parallelism because, although all the nodes at level l can
be processed simultaneously, no node at level l + 1 can be
processed until all the nodes at level l are done. Barriers are
often used to synchronize threads between levels.
BFS can also be formulated as a fixed-point algorithm, in
which operations can be done in any order corresponding to a
chaotic relaxation scheme [9], rather than the order specified
on line 5. This unordered parallelization of BFS can be derived
by noticing that the BFS level of a node is a local minimum
in the graph, i.e., the level of a node (except for the root)
is one more than the minimum level of its neighbors. The
body of the loop, sometimes called a node relaxation in the
literature, will only lower the distance on a node as new paths
Algorithm 2 Serial Cuthill-McKee Algorithm
1 Graph G = input() // Read in graph
2 List cur = { source }, next
3 source.level = 0, nextId = 0, index = 0
4 P[nextId++] = source
5
6 while (!cur.empty()) {
7 for (Node n: cur) {
8 for (Node v: G.neighbors(n)) {
9 if (v.level > n.level + 1) {
10 v.level = n.level + 1
11 next.push(v)
12 } }
13 sort(next[index:end])
14 index+=next.size()
15 }
16
17 for (Node n: next)
18 P[nextId++] = n
19 cur.clear()
20 swap(cur, next)
21 }
are discovered from the source to the node. Applying node
relaxations eventually reaches a fixed-point with correct BFS
levels regardless of the order in which relaxations are applied.
We call this implementation unordered BFS, and the pseudo-
code for unordered BFS is the same as the one in Algorithm 1,
but without the orderedby clause in line 5.
A difference between the ordered algorithm and the un-
ordered one is that with unordered BFS a node can be added
to the wl set (line 10) several times, and thus the level
of a node may be updated multiple times, but correct final
level values can only be determined after all the iterations
have completed. For that reason, with unordered BFS it is
not possible to incrementally produce the permutation array
(line 9) in Algorithm 1. If the permutation array is needed, a
separate pass must be performed with the final level values.
Unordered BFS is used as a common building block in our
parallel reordering algorithms because it scales better than the
ordered algorithm, which must ensure that nodes at level l + 1
are processed only after all nodes of level l. On the other
hand, the unordered algorithm can lead to processing a node
more than once. The extra iterations can be reduced by careful
scheduling, which attempts to follow the scheduling of the
ordered algorithm, but unlike the ordered algorithm, allows
for some deviation in exchange for reduced synchronization.
C. Cuthill-McKee
The Cuthill-McKee (CM) [12] algorithm uses a refinement
of BFS to produce a permutation. In addition to ordering
nodes by level, nodes with the same level are visited in
order of their BFS parent’s placement in the previous level
(choosing the earliest parent in case multiple neighbors are
in the previous level) and in ascending degree order (or
descending degree order for Reverse Cuthill-McKee) for nodes
with the same earliest BFS parent. Reverse Cuthill-McKee
(RCM) uses the opposite ordering relation so nodes are visited
in descending degree order. Since it has been proven that
RCM produces permutations of better or equal quality to the
original Cuthill-McKee [34], we will mainly be concerned
with Reverse Cuthill-McKee (RCM) in this paper, although

..
0
.
b2
.
1
.
b3
.
2
.
b4
.
3
.
4
.
b5
.
5
.
6
..
parents
..
children
.
source
Fig. 1. Example of Cuthill-McKee
similar techniques can be applied for Cuthill-McKee.
Since RCM only adds ordering constraints to BFS, the
traversals and permutations generated can be viewed as choos-
ing a specific traversal out of the many possible BFS traversals.
Algorithm 2 gives a serial implementation. Lines 5 to 11
implement a serial BFS. However, RCM requires sorting the
nodes by degree (line 12) and inserts nodes in the permutation
array as well as in the worklist for the next level in that order
(lines 16 to 17).
There are some issues parallelizing RCM compared to the
simpler parallelization of BFS. Consider the graph in Figure 1.
After executing the first iteration of the serial RCM algorithm,
the worklist contains nodes 1 and 2. In BFS, these nodes can
be processed in any order, as long as none of the successors
of nodes 1 and 2, namely nodes 3, 4, 5 and 6, are processed
before any of their parents (i.e., nodes 1 and 2). However,
in RCM, node 1 should be processed before node 2 because
it has smaller degree, and since processing a node also adds
nodes to be processed in the next level, the children of node
1 (nodes 3 and 4) should be processed before the children of
node 2 (nodes 5 and 6) in the next level.
D. Sloan algorithm
The Sloan algorithm [43, 44] is a reordering algorithm that
attempts to reduce the wavefront of a graph. It is based on
a prioritized graph traversal; nodes are visited in descending
priority order. For a given start and end node e, the priority
of a node i is the weighted sum: P (i) = W
1
incr(i) +
W
2
dist(i, e) where incr(i) is the increase of the size of
the wavefront if node i were chosen next and dist(i, e) is
the BFS distance from node i to the end node. The value of
dist(i, e) is fixed for a given graph and nodes i and e, but
the value of incr(i) changes dynamically as the permutation
is constructed.
Algorithm 3 gives a sketch of the Sloan algorithm and shows
how incr(i) is updated. At each step, a node in the graph
can be in one of these four states: (i) numbered; (ii) active,
a non-numbered node that is a neighbor of a numbered node;
(iii) preactive, a non-numbered and non-active node that is a
neighbor of an active node; and (iv) inactive, all other nodes.
Initially the source node is preactive and all other nodes are
inactive. The algorithm iterates through all the nodes in the
graph and at each step it chooses among the active or preactive
Algorithm 3 Sloan Algorithm
1 Graph G = input() // Read in graph
2 Worklist wl = { source }
3 nextId = 0
4 P[nextId++] = source
5 BFS(G, end) // Compute distances from end node
6
7 foreach (Node n: wl) orderedby (n.priority) {
8 for (Node v: G.neighbors(n)) {
9 if (n.status == preactive
10 && (v.status == inactive
11 || v.status == preactive)) {
12 update(v.priority)
13 v.status = active
14 updateFarNeighbors(v, wl)
15 }
16 else if (n.status == preactive
17 && v.status == active) {
18 update(v.priority)
19 }
20 else if (n.status == active
21 && v.status == preactive) {
22 update(v.priority)
23 v.status = active
24 updateFarNeighbors(v, wl)
25 }
26 }
27 P[nextId++] = n
28 n.status = numbered
29 }
30
31 void updateFarNeighbors(Node v, Worklist wl) {
32 for (Node u: G.neighbors(v)) {
33 if (u.status == inactive) {
34 u.status = preactive
35 wl.push(u)
36 }
37 update(v.priority)
38 }
39 }
nodes the one that maximizes the priority. New priorities are
assigned to the neighbors and the neighbors of the neighbors
of the node selected. Details about how to compute the new
priority of a node are not discussed as they are not important
for the parallelization of the algorithm, but they can be found
in [43, 44].
E. Choosing Source Nodes
Empirical data shows that the quality of reordering algo-
rithms is highly influenced by the nodes chosen as the source
for BFS and RCM or the source and end for Sloan [14, 41, 43].
For this work, we use the algorithm described by Kumfert [29]
that computes a pair of nodes that lie on the pseudo-diameter,
called pseudo-peripheral nodes. The diameter of a graph is
the maximum distance between any two nodes. The pseudo-
diameter is a lower bound on the diameter that is simpler
to compute. Pseudo-peripheral nodes, therefore, are a pair of
nodes that are “far away” from each other in the graph. For
BFS and RCM reordering, we pick one element from the pair
to be the source.
The main computation in computing the pseudo-diameter
is performing multiple breadth-first searches, which can be
parallelized using the ordered or unordered algorithm. In the
rest of the paper, we refer to the procedure for selecting source
nodes as the pseudo-diameter computation.

III. PARALLEL REORDERING
In this Section we describe parallel implementations of
BFS reordering (Section III-A), Reverse Cuthill-McKee (Sec-
tion III-B), and Sloan (Section III-C).
A. BFS
A simple reordering algorithm is to use the traversal order
generated by BFS. As mentioned in Section II-B, a possible
parallel implementation is the ordered BFS algorithm (see
Algorithm 1). Another implementation is to use the unordered
BFS algorithm described in Section II-B. The unordered algo-
rithm has the benefit of not requiring serialization of iterations
like Algorithm 1, but it cannot directly be used as a reordering
algorithm because it does not generate a permutation directly.
Instead, it only generates the final BFS levels for each node.
In this section, we describe how to take the output of
the unordered BFS algorithm and generate, in parallel, a
permutation by choosing one that is consistent with the levels
generated by the unordered BFS.
Algorithm 4 gives the general structure. There are four
major steps:
1) Compute the levels for each node using the unordered
BFS algorithm.
2) Count the number of nodes at each level. In our algo-
rithm, we compute the histograms locally for each thread
and then sum them together.
3) Compute the prefix sum of the final histogram. The
prefix sum gives the beginning offset in the final per-
mutation array for nodes in any given level. Calculating
prefix sums in parallel is well-known.
4) Finally, place nodes in the permutation array. This is
done by dividing the nodes between threads and setting
each node’s position in the permutation to the next free
position for the node’s level. The next free position
is computed from the current value in the prefix sum
array for the level, and it is reserved by atomically
incrementing that value. The value of sums(l) is the next
location in the permutation array to be used by a node
of level l.
This algorithm is non-deterministic. It will always produce
a permutation that is consistent with a BFS ordering of nodes,
but the specific serialization will depend on the number of
threads, the layout of the graph in memory (e.g. the iteration
order over the graph), and the interleaving of the atomic
increments.
B. Reverse Cuthill-McKee
The RCM algorithm, like BFS, depends on the construction
of a level structure to produce a permutation, but it is more
challenging to parallelize because it places additional restric-
tions on valid permutations. We present two parallelizations
of RCM, which differ on how to surmount this challenge. The
first algorithm uses an incremental approach; at each major
step of the algorithm, the permutation for the nodes seen so far
is calculated and does not change in subsequent steps. We call
this the leveled RCM algorithm. The second algorithm uses an
Algorithm 4 BFS Reordering
1 Graph G = input() // Read in graph
2 Unordered BFS(G, source)
3 counts = count(G)
4 sums[1:] = prefixSum(counts)
5 sums[0] = 0
6 place(G, sums)
7
8 Array count(Graph G) {
9 foreach (Node n: G) {
10 ++local count[thread id][n.level];
11 local max[thread id] =
12 max(local max[thread id], n.level)
13 }
14 for (int id: threads) {
15 max level = max(max level, local max[id])
16 }
17 for (int l: 0:max level) {
18 for (int id: threads) {
19 counts[l] += local count[id][l]
20 }
21 }
22 return counts
23 }
24
25 void place(Graph G, Array sums) {
26 foreach (Node n: G) {
27 slot = atomic inc(sums[n.level]) 1
28 P[slot] = n
29 }
30 }
a posteriori approach that builds an RCM-valid permutation
right after a complete level structure is computed. We call this
the unordered RCM algorithm.
1) Leveled Reverse Cuthill-McKee: The leveled algorithm
follows the general structure of the serial RCM Algorithm 2.
It proceeds in iterations, where each iteration expands the
frontier of the graph to the next level. The leveled algorithm
parallelizes the processing of nodes per level. Each iteration
consists of a sequence of steps, where each step executes in
parallel with an implicit barrier between steps, and where each
step executes as an unordered set iterator. When an iteration
completes, nodes in a level have been conclusively added
to the RCM permutation array. The algorithm is shown in
Algorithm 5. The major steps are as follows.
1) The expansion step takes a list of nodes to expand in
parallel. These parent nodes examine their neighbors to
find their child nodes. As in BFS, when a neighbor is
accessed for the first time, its distance from the source
is recorded. Such node is considered a child node in this
iteration. In addition, the appropriate parent is recorded
for each child. Out of all the possible parents, the parent
recorded is the one closer to the source node in the
permutation array.
2) The reduction step computes the number of children for
each parent.
3) A prefix sum step computes for each child an index in
the permutation array according to each child’s parent.
4) The placement step performs the actual placement of
children in the permutation array. Every child node is
placed at the designated range of indices determined for
each parent during the previous step. The sequence of
these ranges respects the RCM ordering of parents. The
placement of children ends by sorting the children of

Citations
More filters
Proceedings ArticleDOI
23 May 2016
TL;DR: This paper presents a first algorithm for just-in-time parallel reordering, named Rabbit Order, which reduces end-to-end runtime by achieving high locality and fast reordering at the same time through two approaches.
Abstract: Ahead-of-time data layout optimization by vertex reordering is a widely used technique to improve memory access locality in graph analysis. While reordered graphs yield better analysis performance, the existing reordering algorithms use significant amounts of computation time to provide efficient vertex ordering, hence, they fail to reduce end-to-end processing time. This paper presents a first algorithm for just-in-time parallel reordering, named Rabbit Order. It reduces end-to-end runtime by achieving high locality and fast reordering at the same time through two approaches. The first approach is hierarchical community-based ordering, which exploits the locality derived from hierarchical community structures in real-world graphs. Our ordering fully leverages low-latency cache levels by mapping hierarchical communities into hierarchical caches. The second approach is parallel incremental aggregation, which improves the runtime efficiency of reordering by decreasing the number of vertices to be processed. In addition, this approach utilizes lightweight atomic operations for concurrency control to avoid locking overheads and achieve high scalability. Our experiments show that Rabbit Order significantly outperforms state-of-the-art reordering algorithms.

85 citations

Proceedings ArticleDOI
01 Sep 2018
TL;DR: This work identifies lightweight re ordering techniques that improve performance even after accounting for the overhead of reordering, and addresses a major impediment to the general adoption of these reordering techniques - input-dependent speedups – by linking the speedup from lightweight reordering to structural properties of the input graph.
Abstract: Graph processing applications are notorious for exhibiting poor cache locality due to an irregular memory access pattern. However, prior work on graph reordering has observed that the structural properties of real-world input graphs can be exploited to improve locality of graph applications. While sophisticated graph reordering techniques are effective at reducing the graph application runtime, the reordering step imposes significant overheads leading to a net increase in end-to-end execution time. The high overhead of sophisticated reordering techniques renders them inapplicable in many important use cases wherein the input graph is processed only a few times and, hence, cannot amortize the overhead of reordering. In this work, we identify lightweight reordering techniques that improve performance even after accounting for the overhead of reordering. We first conduct a detailed performance evaluation of these lightweight reordering techniques across a range of applications to identify the characteristics of applications that benefit the most from lightweight reordering. Next, we address a major impediment to the general adoption of these reordering techniques - input-dependent speedups – by linking the speedup from lightweight reordering to structural properties of the input graph. We leverage the structure dependence of speedup to propose a low-overhead mechanism to determine whether a given input graph would benefit from reordering. Using our selective lightweight reordering, we show maximum end-to-end speedup of up to 1.75x and never cause a slowdown beyond 0.1%.

57 citations


Cites methods from "Parallelization of reordering algor..."

  • ...[19] proposed a parallel implementations of common graph reordering techniques – Reverse Cuthill-McKee (RCM) and Sloan – to reduce reordering overheads....

    [...]

  • ...Several reordering techniques exist in prior work [14], [15], [4], [11], [16], [17], [18], [19]....

    [...]

Proceedings ArticleDOI
18 Oct 2021
TL;DR: I-GCN as discussed by the authors proposes a new online graph restructuring algorithm referred to as islandization, which finds clusters of nodes with strong internal but weak external connections and prunes 38% of aggregation operations.
Abstract: Graph Convolutional Networks (GCNs) have drawn tremendous attention in the past three years. Compared with other deep learning modalities, high-performance hardware acceleration of GCNs is as critical but even more challenging. The hurdles arise from the poor data locality and redundant computation due to the large size, high sparsity, and irregular non-zero distribution of real-world graphs. In this paper we propose a novel hardware accelerator for GCN inference, called I-GCN, that significantly improves data locality and reduces unnecessary computation. The mechanism is a new online graph restructuring algorithm we refer to as islandization. The proposed algorithm finds clusters of nodes with strong internal but weak external connections. The islandization process yields two major benefits. First, by processing islands rather than individual nodes, there is better on-chip data reuse and fewer off-chip memory accesses. Second, there is less redundant computation as aggregation for common/shared neighbors in an island can be reused. The parallel search, identification, and leverage of graph islands are all handled purely in hardware at runtime working in an incremental pipeline. This is done without any preprocessing of the graph data or adjustment of the GCN model structure. Experimental results show that I-GCN can significantly reduce off-chip accesses and prune 38% of aggregation operations, leading to performance speedups over CPUs, GPUs, the prior art GCN accelerators of 5549 ×, 403 ×, and 5.7 × on average, respectively.

41 citations

Journal ArticleDOI
01 Mar 2020
TL;DR: A lightweight offline graph reordering algorithm, HALO (Harmonic Locality Ordering), is proposed that can be used as a pre-processing step for static graphs and specifically aims to cover large directed real world graphs in addition to undirected graphs whereas prior methods only account for the latter.
Abstract: Due to the limited capacity of GPU memory, the majority of prior work on graph applications on GPUs has been restricted to graphs of modest sizes that fit in memory. Recent hardware and software advances make it possible to address much larger host memory transparently as a part of a feature known as unified virtual memory. While accessing host memory over an interconnect is understandably slower, the problem space has not been sufficiently explored in the context of a challenging workload with low computational intensity and an irregular data access pattern such as graph traversal. We analyse the performance of breadth first search (BFS) for several large graphs in the context of unified memory and identify the key factors that contribute to slowdowns. Next, we propose a lightweight offline graph reordering algorithm, HALO (Harmonic Locality Ordering), that can be used as a pre-processing step for static graphs. HALO yields speedups of 1.5x-1.9x over baseline in subsequent traversals. Our method specifically aims to cover large directed real world graphs in addition to undirected graphs whereas prior methods only account for the latter. Additionally, we demonstrate ties between the locality ordering problem and graph compression and show that prior methods from graph compression such as recursive graph bisection can be suitably adapted to this problem.

40 citations


Additional excerpts

  • ...Parallelising RCM itself is also challenging [40]....

    [...]

Book ChapterDOI
07 Jun 2016
TL;DR: In this article, the authors investigate the use of bandwidth and wavefront reduction algorithms to determine a static BDD variable ordering, which reduces the size of BDDs arising in symbolic reachability.
Abstract: We investigate the use of bandwidth and wavefront reduction algorithms to determine a static BDD variable ordering. The aim is to reduce the size of BDDs arising in symbolic reachability. Previous work showed that minimizing the weighted event span of the variable dependency graph yields small BDDs. The bandwidth and wavefront of symmetric matrices are well studied metrics, used in sparse matrix solvers, and many bandwidth and wavefront reduction algorithms are readily available in libraries like Boost and ViennaCL. In this paper, we transform the dependency matrix to a symmetric matrix and apply various bandwidth and wavefront reduction algorithms, measuring their influence on the weighted event span. We show that Sloan's algorithm, executed on the total graph of the dependency matrix, yields a variable order with minimal event span. We demonstrate this on a large benchmark of Petri nets, Dve, Promela, B, and mcrl2 models. As a result, good static variable orders can now be determined in milliseconds by using standard sparse matrix solvers.

28 citations

References
More filters
Journal ArticleDOI
TL;DR: The University of Florida Sparse Matrix Collection, a large and actively growing set of sparse matrices that arise in real applications, is described and a new multilevel coarsening scheme is proposed to facilitate this task.
Abstract: We describe the University of Florida Sparse Matrix Collection, a large and actively growing set of sparse matrices that arise in real applications The Collection is widely used by the numerical linear algebra community for the development and performance evaluation of sparse matrix algorithms It allows for robust and repeatable experiments: robust because performance results with artificially generated matrices can be misleading, and repeatable because matrices are curated and made publicly available in many formats Its matrices cover a wide spectrum of domains, include those arising from problems with underlying 2D or 3D geometry (as structural engineering, computational fluid dynamics, model reduction, electromagnetics, semiconductor devices, thermodynamics, materials, acoustics, computer graphics/vision, robotics/kinematics, and other discretizations) and those that typically do not have such geometry (optimization, circuit simulation, economic and financial modeling, theoretical and quantum chemistry, chemical process simulation, mathematics and statistics, power networks, and other networks and graphs) We provide software for accessing and managing the Collection, from MATLAB™, Mathematica™, Fortran, and C, as well as an online search capability Graph visualization of the matrices is provided, and a new multilevel coarsening scheme is proposed to facilitate this task

3,456 citations


"Parallelization of reordering algor..." refers methods in this paper

  • ...To evaluate our algorithms, we selected a set of ten symmetric matrices, each with more than seven million non-zeros, from the University of Florida Sparse Matrix Collection [13]....

    [...]

Journal ArticleDOI

1,796 citations


"Parallelization of reordering algor..." refers background or methods in this paper

  • ...is dominated by the cost of numerical factorization [16]....

    [...]

  • ...The classical examples are direct methods for solving sparse linear systems: in a sequential implementation, reordering the matrix before factorization can reduce fill, thereby reducing the space and time required to perform the factorization [14, 16]....

    [...]

  • ...Therefore, heuristic reordering algorithms are used in practice, such as minimum-degree, Cuthill-McKee (CM), Reverse CuthillMcKee (RCM), Sloan, and nested dissection orderings [12, 16, 26, 33, 43]....

    [...]

  • ...required to represent the matrix when certain formats like banded and skyline representations are used [16]....

    [...]

  • ...In this paper, we describe the first ever parallelizations of two popular reordering algorithms, Reverse Cuthill-McKee (RCM) [12, 16] and Sloan [43]....

    [...]

Proceedings ArticleDOI
26 Aug 1969
TL;DR: A direct method of obtaining an automatic nodal numbering scheme to ensure that the corresponding coefficient matrix will have a narrow bandwidth is presented.
Abstract: The finite element displacement method of analyzing structures involves the solution of large systems of linear algebraic equations with sparse, structured, symmetric coefficient matrices. There is a direct correspondence between the structure of the coefficient matrix, called the stiffness matrix in this case, and the structure of the spatial network delineating the element layout. For the efficient solution of these systems of equations, it is desirable to have an automatic nodal numbering (or renumbering) scheme to ensure that the corresponding coefficient matrix will have a narrow bandwidth. This is the problem considered by R. Rosen1. A direct method of obtaining such a numbering scheme is presented. In addition several methods are reviewed and compared.

1,518 citations


"Parallelization of reordering algor..." refers methods in this paper

  • ...The Cuthill-McKee (CM) [12] algorithm uses a refinement of BFS to produce a permutation....

    [...]

  • ...Therefore, heuristic reordering algorithms are used in practice, such as minimum-degree, Cuthill-McKee (CM), Reverse CuthillMcKee (RCM), Sloan, and nested dissection orderings [12, 16, 26, 33, 43]....

    [...]

  • ...In this paper, we describe the first ever parallelizations of two popular reordering algorithms, Reverse Cuthill-McKee (RCM) [12, 16] and Sloan [43]....

    [...]

  • ...In this paper, we consider parallelizations of two popular reordering algorithms, Reverse Cuthill-McKee (RCM) [12], which attempts to reduce the bandwidth of symmetric matrices, and Sloan [43, 44], which tries to minimize the wavefront....

    [...]

Journal ArticleDOI
Michele Benzi1
TL;DR: This article surveys preconditioning techniques for the iterative solution of large linear systems, with a focus on algebraic methods suitable for general sparse matrices, including progress in incomplete factorization methods, sparse approximate inverses, reorderings, parallelization issues, and block and multilevel extensions.

1,219 citations


"Parallelization of reordering algor..." refers methods in this paper

  • ...Reducing bandwidth is usually applied as a preprocessing step for sparse matrix-vector multiplication [39] and some preconditioners for iterative methods [7]....

    [...]

Book
01 Jan 2002
TL;DR: This User Guide discusses Graph Construction and Modification Algorithm Visitors, a comparison of GP and OOP and the STL, and implementing Graph Adaptors using BGL Topological Sort with SGB Graphs.
Abstract: Foreword. Preface. I User Guide. 1. Introduction. Some Graph Terminology. Graph Concepts. Vertex and Edge Descriptors. Property Maps. Graph Traversa. Graph Construction and Modification Algorithm Visitors. Graph Classes and Adaptors. Graph Classes. Graph Adaptors. Generic Graph Algorithms. The Topological Sort Generic Algorithm. The Depth-First Search Generic Algorithm. 2.Generic Programming in C++. Introduction. Polymorphism in Object-Oriented Programming. Polymorphism in Generic Programming. Comparison of GP and OOP. Generic Programming and the STL. Concepts and Models. Sets of Requirements. Example: InputIterator. Associated Types and Traits Classes. Associated Types Needed in Function Template. Typedefs Nested in Classes. Definition of a Traits Class. Partial Specialization. Tag Dispatching. Concept Checking. Concept-Checking Classes. Concept Archetypes. The Boost Namespace. Classes. Koenig Lookup. Named Function Parameters. 3. A BGL Tutorial. File Dependencies. Graph Setup. Compilation Order. Topological Sort via DFS. Marking Vertices Using External Properties. Accessing Adjacent Vertices. Traversing All the Vertices. Cyclic Dependencies. Toward a Generic DFS: Visitors. Graph Setup: Internal Properties. Compilation Time. A Generic Topological Sort and DFS. Parallel Compilation Time. Summary. 4. Basic Graph Algorithms. Breadth-First Search. Definitions. Six Degrees of Kevin Bacon. Depth-First Search. Definitions. Finding Loops in Program-Control-Flow Graphs. 5. Shortest-Paths Problems. Definitions. Internet Routing. Bellman-Ford and Distance Vector Routing. Dijkstra and Link-State Routing. 6. Minimum-Spanning-Tree Problem. Definitions. Telephone Network Planning. Kruskal's Algorithm. Prim's Algorithm. 7. Connected Components. Definitions. Connected Components and Internet Connectivity. Strongly Connected Components and Web Page Links. 8. Maximum Flow. Definitions. Edge Connectivity. 9. Implicit Graphs: A Knight's Tour. Knight's Jumps as a Graph. Backtracking Graph Search. Warnsdorff's Heuristic. 10. Interfacing with Other Graph Libraries. Using BGL Topological Sort with a LEDA Graph. Using BGL Topological Sort with a SGB Graph. Implementing Graph Adaptors. 11. Performance Guidelines. Graph Class Comparisons. The Results and Discussion. Conclusion. II Reference Manual. 12. BGL Concepts. Graph Traversal Concepts. Undirected Graphs. Graph. IncidenceGraph. BidirectionalGraph. AdjacencyGraph. VertexListGraph. EdgeListGraph. AdjacencyMatrix. Graph Modification Concepts. VertexMutableGraph. EdgeMutableGraph. MutableIncidenceGraph. MutableBidirectionalGraph. MutableEdgeListGraph. PropertyGraph. VertexMutablePropertyGraph. EdgeMutablePropertyGraph. Visitor Concepts. BFSVisitor. DFSVisitor. DijkstraVisitor. BellmanFordVisitor. 13. BGL Algorithms. Overview. Basic Algorithms. breadth_first_search. breadth_first_visit. depth_first_search. depth_first_visit. topological_sort. Shortest-Path Algorithms. dijkstra_shortest_paths. bellman_ford_shortest_paths. johnson_all_pairs_shortest_paths. Minimum-Spanning-Tree Algorithms. kruskal_minimum_spanning_tree. prim_minimum_spanning_tree. Static Connected Components. connected_components. strong_components. Incremental Connected Components. initialize_incremental_components. incremental_components. same_component. component_index. Maximum-Flow Algorithms. edmunds_karp_max_flow. push_relabel_max_flow. 14. BGL Classes. Graph Classes. adjacency_list. adjacency_matrix. Auxiliary Classes. graph_traits. adjacency_list_traits. adjacency_matrix_traits. property_map. property. Graph Adaptors. edge_list. reverse_graph. filtered_graph. SGB GraphPointer. LEDA GRAPH . std::vector . 15. Property Map Library. Property Map Concepts. ReadablePropertyMap. WritablePropertyMap. ReadWritePropertyMap. LvaluePropertyMap. Property Map Classes. property_traits. iterator_property_map. Property Tags. Creating Your Own Property Maps. Property Maps for Stanford GraphBase. A Property Map Implemented with std::map. 16 Auxiliary Concepts, Classes, and Functions. Buffer. ColorValue. MultiPassInputIterator. Monoid. mutable queue. Disjoint Sets. disjoint_sets. find_with_path_halving. find_with_full_path_compression. tie. graph_property_iter_range. Bibliography. Index. 0201729148T12172001

724 citations


"Parallelization of reordering algor..." refers methods in this paper

  • ...Alternative implementations of these algorithms can be found in the Boost Graph Library [42]....

    [...]

  • ...implementations of these algorithms can be found in the Boost Graph Library [42]....

    [...]

Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "Parallelization of reordering algorithms for bandwidth and wavefront reduction" ?

In this paper, the authors present the first parallel implementations of two widely used reordering algorithms: Reverse Cuthill-McKee ( RCM ) and Sloan.