# Parallelization of reordering algorithms for bandwidth and wavefront reduction

## Summary (4 min read)

### Introduction

- The authors present the first parallel implementations of two widely used reordering algorithms: Reverse Cuthill-McKee (RCM) and Sloan.
- Reordering the matrix using their parallel RCM and then performing 100 SpMV iterations is twice as fast as using HSL-RCM and then performing the SpMV iterations; it is also 1.5 times faster than performing the SpMV iterations without reordering the matrix.
- In all these applications, reordering is performed sequentially even if the sparse matrix computation is performed in parallel.
- Since reordering strategies like RCM and Sloan are heuristics, the authors allow their parallel implementation of a reordering algorithm to produce a reordering that may differ slightly from the one produced by the sequential algorithm, if this improves parallel performance.

### II. BACKGROUND

- The bandwidth of a matrix is the maximum row width, where the width of row i is defined as the difference between the column index of the first and the last non-zero elements in row i.
- If the matrix is symmetric, then the bandwidth is the semi-bandwidth of the matrix, which is the maximum distance from the diagonal.
- While few matrices are banded by default, in many cases the bandwidth of a matrix can be reduced by permuting or renumbering its rows and columns.
- Reducing bandwidth is usually applied as a preprocessing step for sparse matrix-vector multiplication [39] and some preconditioners for iterative methods [7].
- Then, the authors describe breadth-first search (BFS) (Section II-B), an important subroutine to both algorithms.

### A. Galois Programming Model

- The Galois system is a library and runtime system to facilitate parallelization of irregular algorithms, in particular, graph algorithms [15, 27].
- The Galois system adds two concepts to a traditional sequential programming mode: ordered and unordered set iterators.
- The new elements will be processed before the loop finishes.
- An ordered set iterator is like an unordered set iterator with an additional restriction: the serialization of iterations must be consistent with a user-defined ordering relation on iterations.
- In the following sections, the authors introduce algorithms using ordered set iterators for simplicity but their parallelizations in Section III reformulate the algorithms into unordered forms for performance.

### B. Breadth-First Search

- The body of the outer loop is the operator.
- Level values are initialized with a value that is greater than any other level value.
- This order produces a work-efficient BFS implementation but constrains parallelism because, although all the nodes at level l can be processed simultaneously, no node at level l + 1 can be processed until all the nodes at level l are done.
- Barriers are often used to synchronize threads between levels.
- A difference between the ordered algorithm and the unordered one is that with unordered BFS a node can be added to the wl set (line 10) several times, and thus the level of a node may be updated multiple times, but correct final level values can only be determined after all the iterations have completed.

### E. Choosing Source Nodes

- Empirical data shows that the quality of reordering algorithms is highly influenced by the nodes chosen as the source for BFS and RCM or the source and end for Sloan [14, 41, 43].
- For this work, the authors use the algorithm described by Kumfert [29] that computes a pair of nodes that lie on the pseudo-diameter, called pseudo-peripheral nodes.
- The diameter of a graph is the maximum distance between any two nodes.
- Pseudo-peripheral nodes, therefore, are a pair of nodes that are “far away” from each other in the graph.
- For BFS and RCM reordering, the authors pick one element from the pair to be the source.

### A. BFS

- As mentioned in Section II-B, a possible parallel implementation is the ordered BFS algorithm (see Algorithm 1).
- The authors describe how to take the output of the unordered BFS algorithm and generate, in parallel, a permutation by choosing one that is consistent with the levels generated by the unordered BFS.
- In their algorithm, the authors compute the histograms locally for each thread and then sum them together.
- Calculating prefix sums in parallel is well-known.
- This is done by dividing the nodes between threads and setting each node’s position in the permutation to the next free position for the node’s level.

### IV. EXPERIMENTAL RESULTS

- The authors describe the evaluation of their leveled and unordered RCM and Sloan algorithms with well-known third-party implementations from the HSL mathematical software library [23].
- Section IV-C evaluates both execution time and reordering quality across a suite of sparse matrices.
- The selected matrices are shown in Table I.
- All matrices are square and have only one strongly connected component.

### A. Methodology

- The authors evaluate four parallel reordering algorithms: BFS in Algorithm 4 using the unordered BFS described in Section II-B; two RCM algorithms, the leveled RCM in Section III-B1 and the unordered RCM in Section III-B2; and Sloan in Section III-C. collection of Fortran codes for large-scale scientific computations.
- To compute the source nodes for reordering, the authors use the algorithm by Kumfert [29] described in Section II-E.
- The authors noticed that the nodes produced are usually different.
- Each node of the cluster runs CentOS Linux, version 6.3 and consists of 2 Xeon Processors of 8 cores each, operating at 2.7GHz.

### B. Reordering Quality

- Tables II and III show the bandwidth2 and root mean square (RMS) of the wavefront, respectively.
- For each metric, Tables II and III show the initial matrix and the values after applying the permutation.
- Table II shows that HSL’s sequential RCM and their parallel RCM produce very similar bandwidth numbers.
- The BFS reordering generally produces worse bandwidth, and standard deviation shows that the variation due to the non-deterministic results is usually not significant.
- Table III shows RMS wavefront for their Sloan implementation when running in parallel.

### C. Reordering Performance

- The authors compare the execution times of different reordering algorithms.
- The other matrices have a lower number of nonzeros per row (less than 12 in all cases).
- Thus, one reason for the better performance of unordered RCM is that the heuristics used in RCM naturally lead to BFS with a large number of levels, which favors unordered traversal over ordered ones.
- The “only reordering” column in Table IV shows how much time is spent in each HSL program excluding the computation of these nodes, and Figure 2 shows the speedup of the different algorithms when only the execution time of the reordering algorithm itself is considered.
- The speedups with the pseudo-diameter computation are 4.12X, 3.89X and 2.74X respectively.

### D. End-to-end Performance

- The ultimate goal of reordering is to improve the performance of subsequent matrix operations.
- Therefore, in this section, the authors show how local reordering will affect the local part of the SpMV computation.
- Table VI shows the times to run a hundred (100) SpMV iterations with matrices obtained by the different reorderings, using the implementation in the PETSc library and utilizing the 16 cores of a cluster node3.
- Using their parallel RCM reordering, the time to perform 100 SpMV iterations is reduced by 1.5X compared to performing the SpMV iterations without reordering, and it is reduced by 2X compared to using HSL RCM for reordering.
- To quantify this improvement in cache reuse, the authors measured the number of cache misses for SpMV using the matrices obtained with the different reorderings.

### VI. CONCLUSIONS

- Many sparse matrix computations benefit from reordering the sparse matrix before performing the computation.
- To their knowledge, these are the first such parallel implementations.
- Since both these reorderings are heuristics, the authors allow the parallel implementations to produce slightly different reorderings than those produced by the sequential implementations of these algorithms.
- Since this can affect the SpMV time, the impact on the overall time to solution has to determined experimentally.
- Reordering the matrix using their parallel RCM and then performing 100 SpMV iterations is twice as fast as using HSL-RCM and then performing the SpMV iterations; it is also 1.5 times faster than performing the SpMV iterations without reordering the matrix.

Did you find this useful? Give us your feedback

##### Citations

85 citations

57 citations

### Cites methods from "Parallelization of reordering algor..."

...[19] proposed a parallel implementations of common graph reordering techniques – Reverse Cuthill-McKee (RCM) and Sloan – to reduce reordering overheads....

[...]

...Several reordering techniques exist in prior work [14], [15], [4], [11], [16], [17], [18], [19]....

[...]

41 citations

40 citations

### Additional excerpts

...Parallelising RCM itself is also challenging [40]....

[...]

28 citations

##### References

3,456 citations

### "Parallelization of reordering algor..." refers methods in this paper

...To evaluate our algorithms, we selected a set of ten symmetric matrices, each with more than seven million non-zeros, from the University of Florida Sparse Matrix Collection [13]....

[...]

1,796 citations

### "Parallelization of reordering algor..." refers background or methods in this paper

...is dominated by the cost of numerical factorization [16]....

[...]

...The classical examples are direct methods for solving sparse linear systems: in a sequential implementation, reordering the matrix before factorization can reduce fill, thereby reducing the space and time required to perform the factorization [14, 16]....

[...]

...Therefore, heuristic reordering algorithms are used in practice, such as minimum-degree, Cuthill-McKee (CM), Reverse CuthillMcKee (RCM), Sloan, and nested dissection orderings [12, 16, 26, 33, 43]....

[...]

...required to represent the matrix when certain formats like banded and skyline representations are used [16]....

[...]

...In this paper, we describe the first ever parallelizations of two popular reordering algorithms, Reverse Cuthill-McKee (RCM) [12, 16] and Sloan [43]....

[...]

1,518 citations

### "Parallelization of reordering algor..." refers methods in this paper

...The Cuthill-McKee (CM) [12] algorithm uses a refinement of BFS to produce a permutation....

[...]

...Therefore, heuristic reordering algorithms are used in practice, such as minimum-degree, Cuthill-McKee (CM), Reverse CuthillMcKee (RCM), Sloan, and nested dissection orderings [12, 16, 26, 33, 43]....

[...]

...In this paper, we describe the first ever parallelizations of two popular reordering algorithms, Reverse Cuthill-McKee (RCM) [12, 16] and Sloan [43]....

[...]

...In this paper, we consider parallelizations of two popular reordering algorithms, Reverse Cuthill-McKee (RCM) [12], which attempts to reduce the bandwidth of symmetric matrices, and Sloan [43, 44], which tries to minimize the wavefront....

[...]

^{1}

1,219 citations

### "Parallelization of reordering algor..." refers methods in this paper

...Reducing bandwidth is usually applied as a preprocessing step for sparse matrix-vector multiplication [39] and some preconditioners for iterative methods [7]....

[...]

724 citations

### "Parallelization of reordering algor..." refers methods in this paper

...Alternative implementations of these algorithms can be found in the Boost Graph Library [42]....

[...]

...implementations of these algorithms can be found in the Boost Graph Library [42]....

[...]