scispace - formally typeset
Open AccessJournal ArticleDOI

Methods of Conjugate Gradients for Solving Linear Systems

Magnus R. Hestenes, +1 more
- 01 Dec 1952 - 
- Vol. 49, Iss: 6, pp 409-435
TLDR
An iterative algorithm is given for solving a system Ax=k of n linear equations in n unknowns and it is shown that this method is a special case of a very general method which also includes Gaussian elimination.
Abstract
An iterative algorithm is given for solving a system Ax=k of n linear equations in n unknowns. The solution is given in n steps. It is shown that this method is a special case of a very general method which also includes Gaussian elimination. These general algorithms are essentially algorithms for finding an n dimensional ellipsoid. Connections are made with the theory of orthogonal polynomials and continued fractions.

read more

Content maybe subject to copyright    Report

Journal of Research of the National Bureau of Standards
Vol. 49, No. 6, December 1952 Research Paper 2379
Methods of Conjugate Gradients for Solving
Linear Systems
1
Magnus R. Hestenes
2
and Eduard Stiefel
3
An iterative algorithm is given for solving a system Ax=k of n linear equations in n
unknowns. The solution is given in n steps. It is shown that this method is a special
case of a very general method which also includes Gaussian elimination. These general
algorithms are essentially algorithms for finding an n dimensional ellipsoid. Connections
are made with the theory of orthogonal polynomials and continued fractions.
1.
Introduction
One of the major problems in machine computa-
tions is to find an effective method of solving a
system of n simultaneous equations in n unknowns,
particularly if n is large. There is, of course, no
best method for all problems because the goodness
of a method depends to some extent upon the
particular system to be solved. In judging the
goodness of a method for machine computations, one
should bear in mind that criteria for a good machine
method may be different from those for a hand
method. By a hand method, we shall mean one
in which a desk calculator may be used. By a
machine method, we shall mean one in which
sequence-controlled machines are used.
A machine method should have the following
properties:
(1) The method should be simple, composed of a
repetition of elementary routines requiring a mini-
mum of storage space.
(2) The method should insure rapid convergence
if the number of steps required for the solution is
infinite. A method which—if no rounding-off errors
occur—will yield the solution in a finite number of
steps is to be preferred.
(3) The procedure should be stable with respect
to rounding-off errors. If needed, a subroutine
should be available to insure this stability. It
should be possible to diminish rounding-off errors
by a repetition of the same routine, starting with
the previous result as the new estimate of the
solution.
(4) Each step should give information about the
solution and should yield a new and better estimate
than the previous one.
(5) As many of the original data as possible should
be used during each step of the routine. Special
properties of the given linear system—such as having
many vanishing coefficients—should be preserved.
(For example, in the Gauss elimination special
properties of this type may be destroyed.)
In our opinion there are two methods that best fit
these criteria, namely, (a) the Gauss elimination
1
This work was performed on a National Bureau of Standards contract with
the University of California at Los Angeles, and was sponsored (in part) by the
Office of Naval Research, United States Navy.
2
National Bureau of Standards and University of California at Los Angeles.
3
University of California at Los Angeles, and Eidgenossische Technische
Hochschule, Zurich, Switzerland.
method; (b) the conjugate gradient method presented
in the present monograph.
There are many variations of the elimination
method, just as there are many variations of the
conjugate gradient method here presented. In the
present paper it will be shown that both methods
are special cases of a method that we call the method
of conjugate directions. This enables one to com-
pare the two methods from a theoretical point of
view.
In our opinion, the conjugate gradient method is
superior to the elimination method as a machine
method. Our reasons can be stated as follows:
(a) Like the Gauss elimination method, the method
of conjugate gradients gives the solution in n steps if
no rounding-off error occurs.
(b) The conjugate gradient method is simpler to
code and requires less storage space.
(c) The given matrix is unaltered during the proc-
ess,
so that a maximum of the original data is used.
The advantage of having many zeros in the matrix
is preserved. The method is, therefore, especially
suited to handle linear systems arising from difference
equations approximating boundary value problems.
(d) At each step an estimate of the solution is
given, which is an improvement over the one given in
the preceding step.
(e) At any step one can start anew by a very
simple device, keeping the estimate last obtained as
the initial estimate.
In the present paper, the conjugate gradient rou-
tines are developed for the symmetric and non-
symmetric cases. The principal results are described
in section 3. For most of the theoretical considera-
tions,
we restrict ourselves to the positive definite
symmetric case. No generality is lost thereby. We
deal only with real matrices. The extension to
complex matrices is simple.
The method of conjugate gradients was developed
independently by E. Stiefel of the Institute of Applied
Mathematics at Zurich and by M. R. Hestenes with
the cooperation of J. B. Rosser, G. Forsythe, and
L. Paige of the Institute for Numerical Analysis,
National Bureau of Standards. The present account
was prepared jointly by M. R. Hestenes and E.
Stiefel during the latter's stay at the National Bureau
of Standards. The first papers on this method were
409

given by E. Stiefel
4
and by M. R. Hestenes.
5
Reports
on this method were given by E. Stiefel
6
and J. B.
Rosser
7
at a Symposium
8
on August 23—25, 1951.
Recently, C. Lanczos
9
developed a closely related
routine based on his earlier paper on eigenvalue
problem.
10
Examples and numerical tests of the
method have been by R. Hayes, U. Hochstrasser,
and M. Stein.
2.
Notations and Terminology
Throughout the following pages we shall be con-
cerned with the problem of solving a system of linear
equations
(2:1)
+a
2n
x
n
=k
2
These equations will be written in the vector form
Ax=k. Here A is the matrix of coefficients
(a
tj
),
x and k are the vectors (xi,. . .,#„) and (Jc
u
. . .,&„).
It is assumed that A is nonsingular. Its
inverse
A~
l
therefore exists. We denote the transpose of A by
A*.
Given two vectors x=(xi,. . .,#„) and y =
(Viy - -yUn), their sum x-\-y is the vector
Oi
+ 2/i,. .
.,x
n
+y
n
),
and ax is the vector (ax
u
. . .,az»),
where a is a scalar. The sum
(x,y)=xiyi+x
2
y
2
+. . .+x
n
y
n
is their
scalar
product.
The
length
of x will be denoted
by
M=G*M-.
.
.+£)*=(*,*)*.
The Cauchy-Schwarz inequality states that for all
(x,y)*<(x,x)(y,y) or \(x,y)\< \x\ \y\. (2:2)
The matrix A and its transpose A* satisfy the
relation
If a
ij
=a
ji)
that is, if A=A*, then A is said to be
symmetric. A matrix A is said to be positive definite
in case (x,Ax)^>0 whenever z^O. If (x,Ax)^O for
4
E. Stiefel, Uebereinige Methoden der Relaxationsrechnung, Z. angew. Math.
Physik (3) (1952).
s M. R. Hestenes, Iterative methods for solving linear equations, NAML
Report 52-9, National Bureau of Standards (1951).
6
E. Stiefel, Some special methods of relaxation techniques, to appear in the
Proceedings of the symposium (see footnote 8).
7
J. B. Rosser, Rapidly converging iterative methods for solving linear equa-
tions,
to appear in the Proceedings of the symposium (see footnote 8).
s Symposium on simultaneous linear equations and the determination of eigen-
values, Institute for Numerical Analysis, National Bureau of Standards, held on
the campus of the University of California at Los Angeles (August 23-25,1951).
9
C. Lanczos, Solution of systems of linear equations by minimized iterations,
NAML Report
52-13,
National Bureau of Standards (1951).
10
C. Lanczos, An iteration method for the solution of the eigenvalue problem of
linear differential and integral operators, J. Research NBS 45, 255 (1950) RP2133;
Proceedings of Second Symposium on Large-Scale Digital Calculating
Machinery, Cambridge, Mass., 1951, pages 164-206.
all x, then A is said to be
nonnegative.
If A is sym-
metric, then two vectors x and y are said to be con-
jugate or A-orthogonal if the relation (x,Ay) =
(Ax,y)=0 holds. This is an extension of the ortho-
gonality relation (x,y)=0.
By an
eigenvalue
of a matrix A is meant a number
X such that Ay=\y has a solution y^O, and y is
called a corresponding
eigenvector.
Unless otherwise expressly stated the matrix A,
with which we are concerned, will be assumed to be
symmetric and positive definite. Clearly no loss of
generality is caused thereby from a theoretical point
of view, because the system Ax=k is equivalent to
the system Bx=l, where B=A*A, l=A*k. From a
numerical point of view, the two systems are differ-
ent, because of rounding-off errors that occur in
joining the product A*A. Our applications to the
nonsymmetric case do not involve the computation
of A*A.
In the sequel we shall not have occasion to refer to
a particular coordinate of a vector. Accordingly
we may use subscripts to distinguish vectors instead
of components. Thus x
0
will denote the vector
(a;oi,
.,
Xo
n
)
and x
t
the vector (xn, . . ., x
in
). In case
a symbol is to be interpreted as a component, we shall
call attention to this fact unless the interpretation is
evident from the context.
The solution oj
the
system Ax=k will be
denoted
by
h;
that is,
Ah—k.
If x is an estimate of
A,
the differ-
ence r=k~Ax will be called the residual of x as an
estimate of h. The quantity \r\
2
will be called the
sauared
residual. The vector hx will be called the
error vector
of x, as an estimate of h.
3.
Method of Conjugate Gradients (cg-
Method)
The present section will be devoted to a description
of a method of solving a system of linear equations
Ax=k. This method will be called the conjugate
gradient method or, more briefly, the cg-method, for
reasons which will unfold from the theory developed
in later sections. For the moment, we shall limit
ourselves to collecting in one place the basic formulas
upon which the method is based and to describing
briefly how these formulas are used.
The cg-method is an iterative method which
terminates in at most n steps if no rounding-off
errors are encountered. Starting with an initial
estimate x
0
of the solution h, one determines succes-
sively new estimates x
0
,
Xi,
x
2
, . . . of h, the estimate
Xi being closer to h than x
w
. At each step the
residual r
t
=k—Ax
t
is computed. Normally this
vector can be used as a measure of the "goodness"
of the estimate x
t
. However, this measure is not a
reliable one because, as will be seen in section 18,
it is possible to construct cases in which the
squared
residual \r
t
\
2
increases at each step (except for the
last) while the length of the error vector \h—x
t
\
decreases monotonically. If no rounding-off error
is encountered, one will reach an estimate x
m
(m^n)
at which r
w
=0. This estimate is the desired solu-
tion h. Normally, m=n. However, since rounding-
410

off errors always occur except under very unusual
circumstances, the estimate x
n
in general will not be
the solution h but will be a good approximation of h.
If the residual r
n
is too large, one may continue
with the iteration to obtain better estimates of h.
Our experience indicates that frequently x
n
+i is
considerably better than x
n
. One should not con-
tinue too far beyond x
n
but should start anew
with the last estimate obtained as the initial
estimate, so as to diminish the effects of rounding-
off errors. As a matter of fact one can start anew
at any step one chooses. This flexibility is one of the
principal advantages of the method.
In case the matrix A is symmetric and positive
definite, the following formulas are used in the con-
jugate gradient method:
Po=r
Q
—k—Ax
o
(x
0
arbitrary) (3:1a)
at
-
(p
u
Ap
t
)'
(3:1b)
(3:1c)
In place of the formulas (3:1b) and (3
-
le) one may
use
(Pt,r
t
)
(3:2a)
(3:2b)
Although these formulas are slightly more compli-
cated than those given in (3:1), they have the ad-
vantage that scale factors (introduced to increase
accuracy) are more easily changed during the course
of the computation.
The conjugate gradient method (cg-method) is
given by the following steps:
Initial step: Select an estimate x
0
of h and com-
pute the residual r
0
and the direction p
0
by formulas
(3:1a).
General routine: Having determined the estimate
x
t
of h, the residual r
t
, and the direction p
t
, compute
x
i+1
,
r^+i, and
p
i+x
by formulas (3:1b), . . ., (3:If)
successively.
As will be seen in section 5, the residuals r
0
, r
u
. . . are mutually orthogonal, and the direction vec-
tors p
0
, pi, . . . are mutually conjugate, that is,
(r
t
,
r,)=0,
(p
i9
A
Pj
)=0
(3:3)
These relations can be used as checks.
Once one has obtained the set of n mutually
conjugate vectors p
0
, . . ., -p
n
-i the solution of
. (3:4)
(3:5)
It follows that, if we denote by p
tj
the jth component
of p
i}
then
Ax=k'
can be obtained by the formula
_^i (p
u
k')
X
y
t=o
(pi, Ap
t
)
is the element in the jth row and kth. column of the
inverse A~
l
of A,
There are two objections to the use of formula
(3:5).
First, contrary to the procedure of the
general routine (3:1), this would require the storage
of the vectors p
0
, p\, . . . . This is impractical,
particularly in large systems. Second, the results
obtained by this method are much more influenced
by rounding-off errors than those obtained by the
step-by-step routine (3:1).
In the cg-method
the error vector
h—xis diminished
in length at each step. The quantity J(x) =
(hx,
A (h—x)), called the
error
function, is also diminished
at each step. But the
squared
residual \r\
2
=\k—Ax\
2
normally oscillates and may even increase. There
is a modification of the cg-method where all three
quantities diminish at each step. This modification
is given in section 7. It has an advantage and a
disadvantage. Its disadvantage is that the error
vector in each step is longer than in the original
method. Moreover, the computation is complicated,
since it is a routine superimposed upon the original
one.
However, in the special case where the given
linear equation system arises from a difference
approximation of a boundary-value problem, it can
be shown that the estimates are smoother in the
modified method than in the original. This may be
an advantage if the desired solution is to be differ-
entiated afterwards.
Concurrently with the solution of a given linear
system, characteristic roots of its matrix may be
obtained: compute the values of the polynomials
RQ,
Ri, . . . and P
o
, P
u
. . . at
X
by the iteration
P —R A-b-P- (3*6)
The last polynomial R
m
[X) is a factor of the charac-
teristic polynomial of A and coincides with it when
m=n.
The characteristic roots, which are the zeros
of i?m(\), can be found by Newton's methods without
actually computing the polynomial B
m
(\)
itself.
One uses the formulas
(3:7)
411

where
R
m
(\
k
),
R'
m
(\n) are determined by the iteration
(3:6) and,
R'
o
=P'
o
=0
R'
i+
i=R'i-'Ka
i
P'
(
-a
i
P
t
with \=\
k
. In this connection, it is of interest to
observe that if m=n, the determinant of A is given
by the formula
det 01) =
The cg-method can be extended to the case in
which A is a general nonsymmetric and nonsingular
matrix. In this case one replaces eq (3:1) by the set
r
o
=k—Ax
o
,
p
o
=A*r
o
,
(3:8)
p
i+1
=A*r
i+1
-
This system is discussed in section 10.
4.
Method of Conjugate Directions (cd-
Method)
1
The cg-method can be considered as a special case
of a general method, which we shall call the method
of
conjugate
directions or more briefly the cd-method.
In this method, the vectors p
0
, p
l7
. . . are selected
to be mutually conjugate but have no further restric-
tions.
It consists of the following routine:
Initial step. Select an estimate x
0
of h (the solu-
tion),
compute the residual r
o
=k—Ax
o
, and choose a
direction p
0
.
General routine. Having obtained the estimate
Xt of h, the residual r
t
=kAx
t
and the direction p
t
,
compute the new estimate x
i+1
and its residual
r
i+i
by the formulas
a =
1
(
r
l+1
=r
i
—a
i
Ap
i
.
(4:1a)
(4:1b)
(4:1c)
1
This method was presented from a different point of view by Fox, Huskey, and
Wilkinson on p. 149 of a paper entitled "Notes on the solution of algebraic linear
simultaneous equations," Quarterly Journal of Mechanics and Applied Mathe-
matics c. 2, 149-173 (1948).
Next select a direction
p
i+l
conjugate to p
0
,-
that is, such that
Pi,
(4:2)
In a sense the cd-method is not precise, in that no
formulas are given for the computation of the direc-
tions p
Oj
pi, . . . . Various formulas can be given,
each leading to a special method. The formula
(3:
If) leads to the cg-method. It will be seen in
section 12 that the case in which the p's are obtained
by an A-orthogonalization of the basic vectors
(1,0,
. . ., 0), (0,1,0,
...),...
leads essentially to
the Gauss elimination method.
The basic properties of the cd-method are given by
the following theorems.
Theorem 4:1. The direction vectors p
0
, pi, are
mutually conjugate. The residual vector r
t
is orthogonal
to po, Pu -, Pi-i- The inner product of p
t
with each
of the residuals r
0
, r
u
••-,/•< is the same. That is,
(4:3a)
(4:3b)
(4:3c)
(4:4)
in
place
of (4:1a).
Equation (4:3a) follows from (4:2). Using
(4:
lc),
we find that
=
(PJSJC)—a
k
(pj,Ap
k
).
If j=k we have, by (4:1a), (p
k
,r
k+l
) = 0. Moreover,
by (4:3a)
(p
3
,r
k+1
)
=
(p
j
,r
k
),
(j'^k).
Equations (4:3b)
and (4:3c) follow from these relations. The formula
(4:4) follows from (4:3c) and (4:1a).
As a consequence of (4:4) the estimates
Xi,x
2
,
of h can be computed without computing the resid-
uals r
o
,ri,, provided that the choice of the
direction vectors
p
o
,pi,
- is independent of these
residuals.
Theorem 4:2. The
cd-method
is an mstep method
(m
^ n) in the sense that at the mth step the
estimate
x
m
is the
desired
solution h.
For let m be the first integer such that y
o
=h—x
o
r
is
in the subspace spanned by p
0
,, p
m
-\. Clearly,
m^ n, since the vectors
p
o
,pi,
• • are linearly inde-
pendent. We may, accordingly, choose scalars
<*o>
' '
'•>
a
m-i such that
The
scalar
a
t
can
be
given by
the
formula
-CL
m
-lVm-\-
Hence,
h=x
o
+a
o
p
o
+ '
Moreover,
r
o
=k—Ax
o
=A(h—x
o
) =
(
412

Computing the inner product
(Pi,r
0
)
we find by (4:3a)
and (4:4) that a
t
=a
iy
and hence that h=x
m
as was
to be proved.
The cd-method can be looked upon as a relaxation
method. In order to establish this result, we intro-
duce the function
,k).
(4:5)
Clearly, f(x)^0 and f(x)=0 if, and only if, x=h.
The function f(x) can be used as a measure of the
"goodness" of x as an estimate of h. Since it plays
an important role in our considerations, it will be
referred to as the
error
function. If p is a direction
vector, we have the useful relation
(4:6)
where r=k—Ax=A(h—x), as one readily verifies
by substitution. Considered as
a
function of
a,
the function f(x-\-ap) has
a
minimum value
at
a=a, where
This minimum value differs from/(x) by the quantity
(v r)
2
J\X)
j\X\~ap)
==z
a
\jp,^x.pj ~/ ~i r*
(^4
. o)
KP>Ap)
Comparing (4:7) with (4:1a), we obtain the first two
sentences of the following result:
Theorem 4:3. The point x
t
minimizes f(x) on the
line x=Xi-i +
api_i.
At the i-th step the
error
f(x
t
_i)
is
relaxed by the
amount
fixi-j-fixi)^^-
1
'^
2
,-
(4:9)
In fact,
the
point x
t
minimizes f(x) on
the
i-dimensional
plane P
t
of points
(4:10)
where a
0
, ..., a
t
_i are parameters. This plane con-
tains the points
x
Q
,X\,...,
x
t
.
In view of this result the cd-method is a method
of relaxation of the error function/(x). An iteration
of the routine may accordingly be referred to as
a relaxation.
In order to prove the third sentence of the theorem
observe that at the point (4:10)
f(x)=f(x
o
)-T
J
tta&j^
j=o
At the minimum point we have
and hence a.j=a
h
by (4:4). The minimum point is
accordingly the point
Xt,
as was to be proved.
Geometrically, the equation f(x) = const, defines
an ellipsoid of dimension n—1. The point at which
f(x) has a minimum is the center of the ellipsoid and
is the solution of Ax=k. The i-dimensional plane
Pi, described in the last theorem, cuts the ellipsoid
f(x)=f(x
0
) in an ellipsoid
E
t
of dimension i—1,
unless E
t
is the point x
0
itself.
(In the cg-method,
Ei is never degenerate, unless x
o
=h.) The point x
t
is
the center of E
t
. Hence we have the corollary:
Corollary 1. The point
x
t
is
the center of the
(i-l)-dimensional
ellipsoid
in
which the
i-dimensional
plane P
t
intersects the
(n—
1)-dimensional ellipsoid
f(x)=f(x
0
).
Although the function f{x) is the fundamental
error function which decreases at each step of the
relaxation, one is unable to compute f(x) without
knowing the solution h we are seeking. In order to
obtain an estimate of the magnitude oif(x) we may
use the following:
Theorem 4:4. The error vector y=hx, the residual
r=kAx, and the error function f(x) satisfy the
relations
(4:11)
(4:12)
where n(z) is the Rayleigh quotient
The Rayleigh quotient of the error vector
y
does not
exceed that of the residual r, that is,
Moreover,
(4:13)
(4:14)
The proof of this result is based on the Schwarzian
quotients
{
(Az,A
2
z)
{
(z,z) = (z,Az) =(Az,Az)'
^'
Lb)
The first of these follows from the inequality of
Schwarz
(4:16)
by choosing p="Z, q=Az. The second is obtained
by selecting p=Bz, q=B
z
z, where
B
2
=A.
In order to prove theorem 4:4 recall that if we
set y=h—x, then
r=k—Ax=A(h—x)=Ay
f(x)=(y,Ay)
by (4:5). Using the inequalities (4:15) with z=y,
we see that
^(Ay,A*y)
(y,y)
=
(y.Ay)
-f(x)=(Ay,Ay)
_(r,Ar)_
(r,r)
'
413

Citations
More filters
Book

Numerical Optimization

TL;DR: Numerical Optimization presents a comprehensive and up-to-date description of the most effective methods in continuous optimization, responding to the growing interest in optimization in engineering, science, and business by focusing on the methods that are best suited to practical problems.
Journal ArticleDOI

Deep learning in neural networks

TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.
Book

Iterative Methods for Sparse Linear Systems

Yousef Saad
TL;DR: This chapter discusses methods related to the normal equations of linear algebra, and some of the techniques used in this chapter were derived from previous chapters of this book.
Journal ArticleDOI

Phosphorene: An Unexplored 2D Semiconductor with a High Hole Mobility

TL;DR: In this paper, the 2D counterpart of layered black phosphorus, which is called phosphorene, is introduced as an unexplored p-type semiconducting material and the authors find that the band gap is direct, depends on the number of layers and the in-layer strain, and significantly larger than the bulk value of 0.31-0.36 eV.
Journal ArticleDOI

A Rapidly Convergent Descent Method for Minimization

TL;DR: A number of theorems are proved to show that it always converges and that it converges rapidly, and this method has been used to solve a system of one hundred non-linear simultaneous equations.
References
More filters
Journal ArticleDOI

An iteration method for the solution of the eigenvalue problem of linear differential and integral operators

TL;DR: In this article, a systematic method for finding the latent roots and principal axes of a matrix, without reducing the order of the matrix, has been proposed, which is characterized by a wide field of applicability and great accuracy, since the accumulation of rounding errors is avoided, through the process of minimized iterations.

Solution of systems of linear equations by minimize iteration

C. Lanczos
TL;DR: In this paper, the authors adopt the general principles of the previous investigation to the specific demands that arise if we are not interested in the complete analysis of a matrix but only in the more special problem of obtaining the solution of a given set of linear equations.
Journal ArticleDOI

Hydraulic resistance effect upon the dam-break functions

TL;DR: In this article, the dam-break solution with the Chezy resistance formula added to the nonlinear shallow-water equations is studied, and initial conditions are derived for the singularity at the origin.
Journal ArticleDOI

Absorption spectrum of water vapor between 4.5 and 13 microns

TL;DR: The absorption spectrum of water vapor has been measured from 4.5 to 13 microns with a 3,600 line per inch replica echelette grating as the dispersing element.