scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Parallel QR Factorization of Block-Tridiagonal Matrices

02 Nov 2020-SIAM Journal on Scientific Computing (IRIT - Institut de recherche en informatique de Toulouse)-Vol. 42, Iss: 6
TL;DR: This work proposes a matrix permutation approach based on the Nested Dissection method which improves parallelism at the cost of additional computations and storage, and provides a detailed analysis of the approach showing that this extra cost is bounded.
Abstract: In this work, we deal with the QR factorization of block-tridiagonal matrices, where the blocks are dense and rectangular. This work is motivated by a novel method for computing geodesics over Riemannian man-ifolds. If blocks are reduced sequentially along the diagonal, only limited parallelism is available. We propose a matrix permutation approach based on the Nested Dissection method which improves parallelism at the cost of additional computations and storage. We provide a detailed analysis of the approach showing that this extra cost is bounded. Finally, we present an implementation for shared memory systems relying on task parallelism and the use of a runtime system. Experimental results support the conclusions of our analysis and show that the proposed approach leads to good performance and scalability.

Summary (2 min read)

2. Algorithm.

  • The authors description and analysis will rely on the theory of sparse matrix factorizations.
  • The necessary theoretical background is briefly described in the next section; the authors refer the reader to the cited documents for a detailed discussion of this topic.

2.1. Preliminaries.

  • Nested Dissection applies this procedure recursively until subgraphs of a given minimum size are achieved.
  • The resulting tree of separators matches the assembly tree.

2.3. Node elimination.

  • This process closely resembles the QR multifrontal method [2, 13, 10] .
  • It must be noted, however, that in the multifrontal method, frontal matrices are explicitly assembled by copying coefficients into dense matrix data structures which are allocated in memory and initialized to zero beforehand; these copies can be extremely costly due to the large size of blocks and the heavy use of indirect addressing.
  • In their approach, instead, the frontal matrices need not be formed explicitly but the constituent blocks can be easily accessed through pointers (see Section 3 for further details).
  • Additionally, the multifrontal method can only take advantage of the zeroes in the bottom-left part of frontal matrices (this is referred to as "Strategy 3" in the work of Amestoy, Duff, and Puglisi [2] ) whereas, through the use of variable pivoting, their approach can avoid more unnecessary computations.

2.4. Complexity.

  • By the same token, it is possible to compute the amount of memory (in number of coefficients) consumed at each node type, reported in the bottom part of Table 2 ; this is made up of the blocks coming from the original matrix (underlined in the table) plus the fill-in generated during the processing of the node .
  • 2 for all the nodes of the tree leads to the overall cost of the factorization, which is.

Summing up the values in Table

  • In the above formula, C can be replaced with either F or M leading to, respectively, the overall flop count or the overall memory consumption.
  • This manuscript is for review purposes only.
  • Finding an optimal pivotal sequence is an extremely challenging task due to its combinatorial nature.
  • Figure 7 shows which fraction of the total floating point operations is performed within chain or leaf nodes demonstrating that even with matrices of relatively small size it is possible to achieve high levels of parallelism (by increasing l) without incurring an excessive volume of communications.

4. Experimental results.

  • As discussed above, some concurrency is available within each branch or node of the tree, this parallelism involves communications due to the fact that multiple processes share the same data.
  • On shared memory systems, these communications take the form of memory traffic and synchronizations.
  • In distributed memory systems, these communications amount to transferring data through the slow network interconnection and, therefore, are much more penalizing.
  • The use of nested dissection introduces embarrassing parallelism because each process may potentially work on a different branch without communicating with others.

Did you find this useful? Give us your feedback

Figures (16)

Content maybe subject to copyright    Report

General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright
owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
You may not further distribute the material or use it for any profit-making activity or commercial gain
You may freely distribute the URL identifying the publication in the public portal
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.
Downloaded from orbit.dtu.dk on: Aug 09, 2022
Parallel QR factorization of block-tridiagonal matrices
Buttari, Alfredo; Hauberg, Søren; Kodsi, Costy
Published in:
SIAM Journal on Scientific Computing
Link to article, DOI:
10.1137/19M1306166
Publication date:
2020
Document Version
Peer reviewed version
Link back to DTU Orbit
Citation (APA):
Buttari, A., Hauberg, S., & Kodsi, C. (2020). Parallel QR factorization of block-tridiagonal matrices. SIAM Journal
on Scientific Computing, 42(6), C313-C334. https://doi.org/10.1137/19M1306166

PARALLEL QR FACTORIZATION OF BLOCK-TRIDIAGONAL1
MATRICES2
A. BUTTARI
, SØREN HAUBERG
, AND COSTY KODSI
3
Abstract. In this work, we deal with the QR factorization of block-tridiagonal matrices, where4
the blocks are dense and rectangular. This work is motivated by a novel method for computing5
geodesics over Riemannian manifolds. If blocks are reduced sequentially along the diagonal, only6
limited parallelism is available. We propose a matrix permutation approach based on the Nested7
Dissection method which improves parallelism at the cost of additional computations and storage.8
We show how operations can be arranged to keep this extra cost as low as possible. We provide a9
detailed analysis of the approach showing that this extra cost is bounded. Finally, we present an10
implementation for shared memory systems relying on task parallelism and the use a runtime system.11
Experimental results support the conclusions of our analysis and show that the proposed approach12
leads to good performance and scalability.13
Key words. QR factorization, nested dissection, task-based parallelism14
AMS subject classifications. 68W10, 68W40, 65F05, 65F2015
1. Introduction. In this work, we deal with the solution of linear least squares16
problems where the system matrix is block-tridiagonal. By this we mean that our17
matrices have a block structure with c block-rows and c block-columns and along18
block-row k we only have nonzeroes in block-columns k 1 to k + 1 as depicted in19
Figure 1. The blocks are assumed to be dense and rectangular of size m × n with20
m > n.21
In a LAPACK-like QR factorization where sub-diagonal coefficients are elimi-22
nated by means of Householder reflections column-by-column in the natural order,23
only limited parallelism is available because of the lack of update operations (i.e.,24
application of the Householder reflections to the trailing submatrix) stemming from25
the potentially narrow bandwidth and because each elimination step depends on the26
previous. Even the use of tiled [11] or communication-avoiding [14] approaches can27
barely improve the situation especially when c is high and m as well as n is small. By28
choosing a different order for reducing the matrix columns, which essentially amounts29
to permuting the matrix columns, it is possible to achieve better parallelism; this,30
however, generates additional fill-in coefficients that are zero in the original ma-31
trix and are turned into nonzeroes by the factorization which is responsible for32
an increase in the operation count and the storage. We propose an approach that33
computes this permutation by applying the Nested Dissection method to the com-34
pressed matrix graph. This allows us to identify groups of blocks that can be reduced35
independently; this first phase is followed by a reduction step that deals with the36
blocks that are at the interface of the groups. Although the structure and value of37
the R factor only depends on column permutations, the structure of the Q factor38
and, consequently, the overall cost of the factorization greatly depend on the order39
in which coefficients are annihilated within each column or, more generally, on row40
permutations. Taking this into account, we propose an approach based on variable41
pivoting that reduces the operational complexity by avoiding unnecessary computa-42
tions, especially in the reduction phase which involves communications. We provide43
a detailed analysis of the operational and memory cost showing that the overhead44
Universit´e de Toulouse, CNRS-IRIT
DTU, Department of Applied Mathematics and Computer Science
1
This manuscript is for review purposes only.

1
2
· · ·
· · ·
c
m
n
Fig. 1. A block-tridiagonal matrix.
depends on how many nested dissection levels are used but is bounded by a modest45
constant. Finally, we present a parallel implementation for shared memory systems.46
This is based on the use of task parallelism where the whole workload is represented on47
the form of a Directed Acyclic Graph (DAG) of tasks which is automatically created48
and scheduled by a runtime system. This implementation is achieved using features49
of the qr mumps software [1] which relies on the StarPU runtime system [5]. Exper-50
imental results obtained with this implementation show that the proposed method51
achieves good performance and scalability on a 36 cores system and has the potential52
to perform even better on large-scale distributed memory systems.53
1.1. Motivation. One of the key challenges in machine learning (ML) is to54
learn (i.e., estimate) a representation of data that is suitable for a given task [6].55
Learned representations for supervised ML tasks, such as classification and regression,56
are optimal only for specific tasks. In contrast, representations for unsupervised ML57
tasks, such as data exploration, are concerned with the underlying structure governing58
the observed data.59
It has been shown in [19, 4] that an effective approach for unsupervised ML in-60
volves treating the learned representation as a Riemannian manifold. An unfortunate61
implication, however, is that vector space operations are no longer applicable. Instead,62
Riemannian counterparts, such as logarithm and exponential maps take the place of63
subtraction and addition operations. Also, geodesics in their role as generalizations64
of line segments in Euclidean space become prominent. Just as a segment of a line65
is the shortest distance connecting two distinct points, a geodesic on a Riemannian66
manifold is a smooth parametric curve that is a local length minimizer connecting any67
two distinct points. Additionally, a point moving along a geodesic does not experience68
acceleration in the tangent direction, i.e., the velocity has a constant magnitude.69
Consider now an n-dimensional manifold embedded in m-dimensional Euclidean70
space (with m > n) described by the parametric map71
y : R
n
R
m
.72
2
This manuscript is for review purposes only.

A rather simple and intuitive strategy for determining a geodesic given any two dis-73
tinct points on a Riemannian manifold requires working with the discretized version74
of a smooth parametric curve. It is then possible to model the curve as a series of75
connected linear elastic springs. A spring force f R
m
is related to the end points76
y(x
i
), y(x
j
) R
m
of a curve segment by f = K (y(x
j
) y(x
i
)), in which x R
n
and77
K is the identity matrix representing the stiffness. Even though the springs are linear78
in nature, the parameterization introduces potential nonlinearity. Imposing boundary79
conditions (constituting the given points) on the assembled block tridiagonal (total)80
stiffness matrix as well as force vector, a system of equations can be solved for the81
unknown curve points yielding the stable equilibrium of the spring assemblage and82
geodesic. If a system of nonlinear equations has to be tackled, which is the working83
assumption, Newton’s method or a variation of Newton’s method is a popular choice.84
In each iteration of Newton’s method, the solution of an overdetermined system of85
linear equations featuring a full-rank Jacobian is sought as a trial step. It is possible86
to multiply the transpose of the Jacobian to both sides of the overdetermined system87
for symmetry purposes in preparation for solution. Since the Jacobian is likely to88
be poorly conditioned, this is not desirable as it has the effect of squaring the con-89
dition number of the problem. Alternatively, a solution methodology involving QR90
factorization of the Jacobian does not unnecessarily degrade the conditioning of the91
problem.92
1.2. Related work. Our work can be related to the factorization of banded,93
(block) bi or tridiagonal matrices. Most of the approaches presented in the literature94
draw parallelism from cyclic reduction [22], wrap-around [20], partitioning methods95
or minor variants thereof; these techniques amount to identifying independent sets96
of computations that allow for reducing multiple (block) columns at once at the cost97
of additional computations due to fill-in. The method we propose in this work relies98
on the use of nested dissection (see Sections 2.1 and 2.2) and, essentially, leads to99
permutations that are equivalent to those computed with cyclic reduction or wrap-100
around; because nested dissection relies on graph theory, we believe it provides a101
clearer framework that allows for a better understanding and analysis of parallelism102
and fill-in.103
Early work on the QR factorization of square tridiagonal systems by Givens ro-104
tations is proposed by Sameh and Kuck [28]. Computations are organized in two105
stages. In the first diagonal blocks are reduced independently; this introduces fill-in106
at the interface of contiguous diagonal blocks which is reduced in the second stage.107
These ideas eventually lead to the definition of spike [27, 25] algorithm for the LU108
factorization of banded matrices. This approach was also used by Berry and Sameh109
[7] to compute the parallel LU factorization of block-tridiagonal, diagonally dominant110
(i.e., they do not require pivoting) matrices.111
Arbenz and Hegland [3] propose a method for the stable solution of (periodic)112
square banded linear systems. The factorization is achieved in two phases assuming113
that the original matrix is permuted into standard form (i.e., lower-banded periodic)114
through a circular shift; this permutation introduces some unnecessary fill-in. In the115
first phase, a partitioning of the matrix (which amounts to a dissection step) and116
block-columns associated with the resulting parts are reduced independently. This117
leads to a periodic block-bidiagonal system which is reduced in parallel using cyclic118
reduction. They discuss both the LU and QR factorization although mostly focus on119
the first. Note that the QR factorization of square, periodic block-bidiagonal systems120
through cyclic reduction is also addressed by Hegland and Osborne [21]121
3
This manuscript is for review purposes only.

None of the aforementioned approaches consider the case of overdetermined sys-122
tems; this has non-trivial implications because when a block-column is reduced, some123
coefficients along the corresponding pivotal block-row remain which have to be re-124
duced. Also, all of these methods only draw parallelism from the matrix permutation125
or partitioning whereas our approach exploits multiple levels of parallelism as de-126
scribed in Section 3.127
All of the methods referenced, including ours, are, in essence, specialized mul-128
tifrontal methods [15]. Algebraic multifrontal methods for the QR factorization of129
sparse matrices have been proposed in the literature by Amestoy, Duff, and Puglisi130
[2], Davis [13] and Buttari [10], for instance. These methods and the correspond-131
ing tools are designed for generic sparse matrices; although they can obviously be132
used for solving block-tridiagonal systems, they may not be the best suited tools.133
Indeed, unlike common sparse matrices, our block-tridiagonal systems may be rela-134
tively dense in the beginning (multiple thousands of coefficients per row) and fill up135
moderately during the factorization (see Section 2.4). As a result, algebraic sparse136
multifrontal solvers may suffer from the excessive overhead due to the handling of137
the coefficients of the original matrix based on indirect addressing. Clearly, these138
methods could be extended to efficiently handle block matrices; this is the object of139
future work. Additionally, multifrontal methods involve explicit assembly of frontal140
matrices which can be avoided in our case due to the simple, regular structure of the141
data (see Section 2.3). Finally, in sparse multifrontal solvers, frontal matrices are142
typically reduced using a LAPACK-like factorization which does not fully take into143
account their possibly sparse nature as explained in Section 2.3.144
2. Algorithm. In this section we describe our approach to parallelize the QR145
factorization of a block-tridiagonal matrix. Our description and analysis will rely on146
the theory of sparse matrix factorizations. The necessary theoretical background is147
briefly described in the next section; we refer the reader to the cited documents for a148
detailed discussion of this topic.149
2.1. Preliminaries. As it is commonly done in the literature [12, 10], our sym-150
bolic analysis of the QR factorization of a sparse matrix relies on the equivalence151
between the R factor of a real matrix A and the Cholesky factor of the normal equa-152
tion matrix B = A
T
A, whenever the Strong Hall property holds. As a result of153
this equivalence, the QR factorization of a sparse matrix A follows the same com-154
putational pattern as the Cholesky factorization of B. This can be modeled us-155
ing the adjacency graph of B defined as a graph G(B) = (V, E) whose vertex set156
V = {1, 2, ..., cn} includes the unknowns associated with rows and columns of B and157
edge set E = {(i, j) b
i,j
6= 0} includes an edge for each nonzero coefficient of B. In158
our case, the symmetry of B implies that if (i, j) is in E, then so is (j, i), and therefore159
we will only represent one of these edges; such a graph is called undirected.160
It is possible to use the adjacency graph to model the Cholesky factorization of161
B. Specifically we can build a sequence of graphs G(B) = G
0
(B), G
1
(B), ..., G
cn1
(B),162
called elimination graphs, such that G
k
(B) is the adjacency graph of the trailing163
submatrix after elimination of variables 1, 2, ..., k. Elimination graph G
k
(B) is built164
from G
k1
(B) by removing node k as well as all its incident edges and by updating165
the connectivity of the remaining nodes: for any two nodes i and j that are neighbors166
of k, we have to add an edge connecting them if it does not exist already. One such167
edge, called a fill edge, models the occurrence of a new fill-in coefficient.168
A sparse factorization can achieve better parallelism than a dense one because169
sparsity implies that the elimination of one unknown does not affect all the unelimi-170
4
This manuscript is for review purposes only.

Citations
More filters
Journal ArticleDOI
TL;DR: In this article , a fast solution method for banded linear systems is proposed, which transforms the original system into an equivalent one with an almost block triangular coefficient matrix, and then constructs a preconditioner based on this formulation.

1 citations

Journal ArticleDOI
TL;DR: This work explores the partitioning of large symmetric matrix data for QR factorization using Givens rotations and its parallel implementation using MPI and CUDA is presented.
Abstract: Modern supercomputers incorporate the use of multi-core processors and graphics processing units. Applications running on these computers take advantage of these technologies with scalable programs that work with multicores and accelerator such as graphics processing unit. QR factorization is essential for several numerical tasks, such as linear equations solvers, compute inverse matrix or compute a diagonal matrix, to name a few. There are several factorization algorithm such as LU, Cholesky, Givens and Householder, among others. The efficient parallel implementation of each parallelization algorithm will depend on the structure of the data and the type of parallel architecture used. A common strategy in parallel programming is to break a problem into subproblems to solve them in different processing units. This is very useful when dealing with complex problems or when the data is too large to work with the available memory. However, it is not clear how data partitioning affects subtask performance when mapping to processing units, specifically to graphical processing units. This work explores the partitioning of large symmetric matrix data for QR factorization using Givens rotations and its parallel implementation using MPI and CUDA is presented.

1 citations


Cites methods from "Parallel QR Factorization of Block-..."

  • ...There are other works where different factorization methods (LU, Cholesky or Householder) are parallelized for multicore architectures or GPUs [22], [23], [24], [25], [26][25], [27], [28]....

    [...]

Journal ArticleDOI
TL;DR: In this article, a fast solution method for banded linear systems is proposed, which transforms the original system into an equivalent one with an almost block triangular coefficient matrix, and then constructs a preconditioner based on this formulation.

1 citations

References
More filters
Journal ArticleDOI
TL;DR: Recent work in the area of unsupervised feature learning and deep learning is reviewed, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks.
Abstract: The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.

11,201 citations


"Parallel QR Factorization of Block-..." refers background in this paper

  • ..., estimate) a representation of data that is suitable for a given task [5]....

    [...]

MonographDOI
TL;DR: This book aims to be suitable also for a student course, probably at MSc level, and the subject is intensely practical and this book is written with practicalities ever in mind.
Abstract: The subject of sparse matrices has its roots in such diverse fields as management science, power systems analysis, surveying, circuit theory, and structural analysis. Mathematical models in all these areas give rise to very large systems of linear equations which could not be solved were it not for the fact that the matrices contain relatively few non-zero entries. Only comparatively recently, in the last fifteen years or so, has it become apparent that the equations can be solved even when the pattern is irregular, and it is primarily the solution of such problems that is considered in this book. The subject is intensely practical and this book is written with practicalities ever in mind. Whenever two methods are applicable, their rival merits are considered, and conclusions are based on practical experience where theoretical comparison is not possible. Non-numeric computing techniques have been included, as well as frequent illustrations, in an attempt to bridge the usually wide gap between the printed page and the working computer code. Despite this practical bias, it is recognized that many aspects of the subject are of interest in their own right, and the book aims to be suitable also for a student course, probably at MSc level. Exercises have been included to illustrate and strengthen understanding of the material, as well as to extend it. Efficient use of sparsity is a key to solving large problems in many fields. This book will supply both insight and answers for those attempting to solve these problems.

1,907 citations


"Parallel QR Factorization of Block-..." refers methods in this paper

  • ...All of the methods referenced, including ours, are, in essence, specialized multifrontal methods [13]....

    [...]

Book
15 Sep 2006
TL;DR: This paper presents a meta-modelling framework for solving sparse linear systems using cholesky factorization and CSparse, and some examples show how this framework can be modified to handle sparse matrices.
Abstract: Preface 1. Introduction 2. Basic algorithms 3. Solving triangular systems 4. Cholesky factorization 5. Orthogonal methods 6. LU factorization 7. Fill-reducing orderings 8. Solving sparse linear systems 9. CSparse 10. Sparse matrices in MATLAB Appendix: Basics of the C programming language Bibliography Index.

1,366 citations


"Parallel QR Factorization of Block-..." refers methods in this paper

  • ...As it is commonly done in the literature [10, 8], our symbolic analysis of the QR factorization of a sparse matrix relies on the equivalence between the R factor of a real matrix A and the Cholesky factor of the normal equations B = AA....

    [...]

Journal ArticleDOI
TL;DR: A graph-theoretic elimination process which is related to performing Gaussian elimination on sparse symmetric positive definite systems of linear equations is considered, and it is conjecture that the problem of finding a minimum ordering is NP-complete.
Abstract: We consider a graph-theoretic elimination process which is related to performing Gaussian elimination on sparse symmetric positive definite systems of linear equations. We give a new linear-time algorithm to calculate the fill-in produced by any elimination ordering, and we give two new related algorithms for finding orderings with special properties. One algorithm, based on breadth-first search, finds a perfect elimination ordering, if any exists, in $O(n + e)$ time, if the problem graph has n vertices and e edges. An extension of this algorithm finds a minimal (but not necessarily minimum) ordering in $O(ne)$ time. We conjecture that the problem of finding a minimum ordering is NP-complete

1,317 citations


"Parallel QR Factorization of Block-..." refers background in this paper

  • ...If the nodes in S are eliminated last, no fill-in is possible between the nodes of G1 and those of G2 because of the Rose, Tarjan and Lueker theorem [22]....

    [...]

Journal ArticleDOI
01 Feb 2011
TL;DR: StarPU as mentioned in this paper is a runtime system that provides a high-level unified execution model for numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware and easily develop and tune powerful scheduling algorithms.
Abstract: In the field of HPC, the current hardware trend is to design multiprocessor architectures featuring heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE) or data-parallel accelerators (e.g. GPUs). Approaching the theoretical performance of these architectures is a complex issue. Indeed, substantial efforts have already been devoted to efficiently offload parts of the computations. However, designing an execution model that unifies all computing units and associated embedded memory remains a main challenge. We therefore designed StarPU, an original runtime system providing a high-level, unified execution model tightly coupled with an expressive data management library. The main goal of StarPU is to provide numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware on the one hand, and easily develop and tune powerful scheduling algorithms on the other hand. We have developed several strategies that can be selected seamlessly at run-time, and we have analyzed their efficiency on several algorithms running simultaneously over multiple cores and a GPU. In addition to substantial improvements regarding execution times, we have obtained consistent superlinear parallelism by actually exploiting the heterogeneous nature of the machine. We eventually show that our dynamic approach competes with the highly optimized MAGMA library and overcomes the limitations of the corresponding static scheduling in a portable way. Copyright © 2010 John Wiley & Sons, Ltd.

1,116 citations

Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "Parallel qr factorization of block-tridiagonal matrices" ?

In this work, the authors deal with the QR factorization of block-tridiagonal matrices, where 4 the blocks are dense and rectangular. This work is motivated by a novel method for computing 5 geodesics over Riemannian manifolds. The authors propose a matrix permutation approach based on the Nested 7 Dissection method which improves parallelism at the cost of additional computations and storage. The authors show how operations can be arranged to keep this extra cost as low as possible. The authors provide a 9 detailed analysis of the approach showing that this extra cost is bounded. Finally, the authors present an 10 implementation for shared memory systems relying on task parallelism and the use a runtime system. 11 Experimental results support the conclusions of their analysis and show that the proposed approach 12 leads to good performance and scalability.