Methods of Conjugate Gradients for Solving Linear Systems

doi:10.6028/JRES.049.044

Journal of Research of the National Bureau of Standards

Vol. 49, No. 6, December 1952 Research Paper 2379

Methods of Conjugate Gradients for Solving

Linear Systems

1

Magnus R. Hestenes

2

and Eduard Stiefel

3

An iterative algorithm is given for solving a system Ax=k of n linear equations in n

unknowns. The solution is given in n steps. It is shown that this method is a special

case of a very general method which also includes Gaussian elimination. These general

algorithms are essentially algorithms for finding an n dimensional ellipsoid. Connections

are made with the theory of orthogonal polynomials and continued fractions.

1.

Introduction

One of the major problems in machine computa-

tions is to find an effective method of solving a

system of n simultaneous equations in n unknowns,

particularly if n is large. There is, of course, no

best method for all problems because the goodness

of a method depends to some extent upon the

particular system to be solved. In judging the

goodness of a method for machine computations, one

should bear in mind that criteria for a good machine

method may be different from those for a hand

method. By a hand method, we shall mean one

in which a desk calculator may be used. By a

machine method, we shall mean one in which

sequence-controlled machines are used.

A machine method should have the following

properties:

(1) The method should be simple, composed of a

repetition of elementary routines requiring a mini-

mum of storage space.

(2) The method should insure rapid convergence

if the number of steps required for the solution is

infinite. A method which—if no rounding-off errors

occur—will yield the solution in a finite number of

steps is to be preferred.

(3) The procedure should be stable with respect

to rounding-off errors. If needed, a subroutine

should be available to insure this stability. It

should be possible to diminish rounding-off errors

by a repetition of the same routine, starting with

the previous result as the new estimate of the

solution.

(4) Each step should give information about the

solution and should yield a new and better estimate

than the previous one.

(5) As many of the original data as possible should

be used during each step of the routine. Special

properties of the given linear system—such as having

many vanishing coefficients—should be preserved.

(For example, in the Gauss elimination special

properties of this type may be destroyed.)

In our opinion there are two methods that best fit

these criteria, namely, (a) the Gauss elimination

1

This work was performed on a National Bureau of Standards contract with

the University of California at Los Angeles, and was sponsored (in part) by the

Office of Naval Research, United States Navy.

2

National Bureau of Standards and University of California at Los Angeles.

3

University of California at Los Angeles, and Eidgenossische Technische

Hochschule, Zurich, Switzerland.

method; (b) the conjugate gradient method presented

in the present monograph.

There are many variations of the elimination

method, just as there are many variations of the

conjugate gradient method here presented. In the

present paper it will be shown that both methods

are special cases of a method that we call the method

of conjugate directions. This enables one to com-

pare the two methods from a theoretical point of

view.

In our opinion, the conjugate gradient method is

superior to the elimination method as a machine

method. Our reasons can be stated as follows:

(a) Like the Gauss elimination method, the method

of conjugate gradients gives the solution in n steps if

no rounding-off error occurs.

(b) The conjugate gradient method is simpler to

code and requires less storage space.

(c) The given matrix is unaltered during the proc-

ess,

so that a maximum of the original data is used.

The advantage of having many zeros in the matrix

is preserved. The method is, therefore, especially

suited to handle linear systems arising from difference

equations approximating boundary value problems.

(d) At each step an estimate of the solution is

given, which is an improvement over the one given in

the preceding step.

(e) At any step one can start anew by a very

simple device, keeping the estimate last obtained as

the initial estimate.

In the present paper, the conjugate gradient rou-

tines are developed for the symmetric and non-

symmetric cases. The principal results are described

in section 3. For most of the theoretical considera-

tions,

we restrict ourselves to the positive definite

symmetric case. No generality is lost thereby. We

deal only with real matrices. The extension to

complex matrices is simple.

The method of conjugate gradients was developed

independently by E. Stiefel of the Institute of Applied

Mathematics at Zurich and by M. R. Hestenes with

the cooperation of J. B. Rosser, G. Forsythe, and

L. Paige of the Institute for Numerical Analysis,

National Bureau of Standards. The present account

was prepared jointly by M. R. Hestenes and E.

Stiefel during the latter's stay at the National Bureau

of Standards. The first papers on this method were

409

given by E. Stiefel

4

and by M. R. Hestenes.

5

Reports

on this method were given by E. Stiefel

6

and J. B.

Rosser

7

at a Symposium

8

on August 23—25, 1951.

Recently, C. Lanczos

9

developed a closely related

routine based on his earlier paper on eigenvalue

problem.

10

Examples and numerical tests of the

method have been by R. Hayes, U. Hochstrasser,

and M. Stein.

2.

Notations and Terminology

Throughout the following pages we shall be con-

cerned with the problem of solving a system of linear

equations

(2:1)

+a

2n

x

n

=k

2

These equations will be written in the vector form

Ax=k. Here A is the matrix of coefficients

(a

tj

),

x and k are the vectors (xi,. . .,#„) and (Jc

u

. . .,&„).

It is assumed that A is nonsingular. Its

inverse

A~

l

therefore exists. We denote the transpose of A by

A*.

Given two vectors x=(xi,. . .,#„) and y =

(Viy - -yUn), their sum x-\-y is the vector

Oi

+ 2/i,. .

.,x

n

+y

n

),

and ax is the vector (ax

u

. . .,az»),

where a is a scalar. The sum

(x,y)=xiyi+x

2

y

2

+. . .+x

n

y

n

is their

scalar

product.

The

length

of x will be denoted

by

M=G*M-.

.

.+£)*=(*,*)*.

The Cauchy-Schwarz inequality states that for all

(x,y)*<(x,x)(y,y) or \(x,y)\< \x\ \y\. (2:2)

The matrix A and its transpose A* satisfy the

relation

If a

ij

=a

ji)

that is, if A=A*, then A is said to be

symmetric. A matrix A is said to be positive definite

in case (x,Ax)^>0 whenever z^O. If (x,Ax)^O for

4

E. Stiefel, Uebereinige Methoden der Relaxationsrechnung, Z. angew. Math.

Physik (3) (1952).

s M. R. Hestenes, Iterative methods for solving linear equations, NAML

Report 52-9, National Bureau of Standards (1951).

6

E. Stiefel, Some special methods of relaxation techniques, to appear in the

Proceedings of the symposium (see footnote 8).

7

J. B. Rosser, Rapidly converging iterative methods for solving linear equa-

tions,

to appear in the Proceedings of the symposium (see footnote 8).

s Symposium on simultaneous linear equations and the determination of eigen-

values, Institute for Numerical Analysis, National Bureau of Standards, held on

the campus of the University of California at Los Angeles (August 23-25,1951).

9

C. Lanczos, Solution of systems of linear equations by minimized iterations,

NAML Report

52-13,

National Bureau of Standards (1951).

10

C. Lanczos, An iteration method for the solution of the eigenvalue problem of

linear differential and integral operators, J. Research NBS 45, 255 (1950) RP2133;

Proceedings of Second Symposium on Large-Scale Digital Calculating

Machinery, Cambridge, Mass., 1951, pages 164-206.

all x, then A is said to be

nonnegative.

If A is sym-

metric, then two vectors x and y are said to be con-

jugate or A-orthogonal if the relation (x,Ay) =

(Ax,y)=0 holds. This is an extension of the ortho-

gonality relation (x,y)=0.

By an

eigenvalue

of a matrix A is meant a number

X such that Ay=\y has a solution y^O, and y is

called a corresponding

eigenvector.

Unless otherwise expressly stated the matrix A,

with which we are concerned, will be assumed to be

symmetric and positive definite. Clearly no loss of

generality is caused thereby from a theoretical point

of view, because the system Ax=k is equivalent to

the system Bx=l, where B=A*A, l=A*k. From a

numerical point of view, the two systems are differ-

ent, because of rounding-off errors that occur in

joining the product A*A. Our applications to the

nonsymmetric case do not involve the computation

of A*A.

In the sequel we shall not have occasion to refer to

a particular coordinate of a vector. Accordingly

we may use subscripts to distinguish vectors instead

of components. Thus x

0

will denote the vector

(a;oi,

• • .,

Xo

n

)

and x

t

the vector (xn, . . ., x

in

). In case

a symbol is to be interpreted as a component, we shall

call attention to this fact unless the interpretation is

evident from the context.

The solution oj

the

system Ax=k will be

denoted

by

h;

that is,

Ah—k.

If x is an estimate of

A,

the differ-

ence r=k~Ax will be called the residual of x as an

estimate of h. The quantity \r\

2

will be called the

sauared

residual. The vector h—x will be called the

error vector

of x, as an estimate of h.

3.

Method of Conjugate Gradients (cg-

Method)

The present section will be devoted to a description

of a method of solving a system of linear equations

Ax=k. This method will be called the conjugate

gradient method or, more briefly, the cg-method, for

reasons which will unfold from the theory developed

in later sections. For the moment, we shall limit

ourselves to collecting in one place the basic formulas

upon which the method is based and to describing

briefly how these formulas are used.

The cg-method is an iterative method which

terminates in at most n steps if no rounding-off

errors are encountered. Starting with an initial

estimate x

0

of the solution h, one determines succes-

sively new estimates x

0

,

Xi,

x

2

, . . . of h, the estimate

Xi being closer to h than x

w

. At each step the

residual r

t

=k—Ax

t

is computed. Normally this

vector can be used as a measure of the "goodness"

of the estimate x

t

. However, this measure is not a

reliable one because, as will be seen in section 18,

it is possible to construct cases in which the

squared

residual \r

t

\

2

increases at each step (except for the

last) while the length of the error vector \h—x

t

\

decreases monotonically. If no rounding-off error

is encountered, one will reach an estimate x

m

(m^n)

at which r

w

=0. This estimate is the desired solu-

tion h. Normally, m=n. However, since rounding-

410

off errors always occur except under very unusual

circumstances, the estimate x

n

in general will not be

the solution h but will be a good approximation of h.

If the residual r

n

is too large, one may continue

with the iteration to obtain better estimates of h.

Our experience indicates that frequently x

n

+i is

considerably better than x

n

. One should not con-

tinue too far beyond x

n

but should start anew

with the last estimate obtained as the initial

estimate, so as to diminish the effects of rounding-

off errors. As a matter of fact one can start anew

at any step one chooses. This flexibility is one of the

principal advantages of the method.

In case the matrix A is symmetric and positive

definite, the following formulas are used in the con-

jugate gradient method:

Po=r

Q

—k—Ax

o

(x

0

arbitrary) (3:1a)

at

-

(p

u

Ap

t

)'

(3:1b)

(3:1c)

In place of the formulas (3:1b) and (3

-

le) one may

use

(Pt,r

t

)

(3:2a)

(3:2b)

Although these formulas are slightly more compli-

cated than those given in (3:1), they have the ad-

vantage that scale factors (introduced to increase

accuracy) are more easily changed during the course

of the computation.

The conjugate gradient method (cg-method) is

given by the following steps:

Initial step: Select an estimate x

0

of h and com-

pute the residual r

0

and the direction p

0

by formulas

(3:1a).

General routine: Having determined the estimate

x

t

of h, the residual r

t

, and the direction p

t

, compute

x

i+1

,

r^+i, and

p

i+x

by formulas (3:1b), . . ., (3:If)

successively.

As will be seen in section 5, the residuals r

0

, r

u

. . . are mutually orthogonal, and the direction vec-

tors p

0

, pi, . . . are mutually conjugate, that is,

(r

t

,

r,)=0,

(p

i9

A

Pj

)=0

(3:3)

These relations can be used as checks.

Once one has obtained the set of n mutually

conjugate vectors p

0

, . . ., -p

n

-i the solution of

. (3:4)

(3:5)

It follows that, if we denote by p

tj

the jth component

of p

i}

then

Ax=k'

can be obtained by the formula

_^i (p

u

k')

X

y

t=o

(pi, Ap

t

)

is the element in the jth row and kth. column of the

inverse A~

l

of A,

There are two objections to the use of formula

(3:5).

First, contrary to the procedure of the

general routine (3:1), this would require the storage

of the vectors p

0

, p\, . . . . This is impractical,

particularly in large systems. Second, the results

obtained by this method are much more influenced

by rounding-off errors than those obtained by the

step-by-step routine (3:1).

In the cg-method

the error vector

h—xis diminished

in length at each step. The quantity J(x) =

(h—x,

A (h—x)), called the

error

function, is also diminished

at each step. But the

squared

residual \r\

2

=\k—Ax\

2

normally oscillates and may even increase. There

is a modification of the cg-method where all three

quantities diminish at each step. This modification

is given in section 7. It has an advantage and a

disadvantage. Its disadvantage is that the error

vector in each step is longer than in the original

method. Moreover, the computation is complicated,

since it is a routine superimposed upon the original

one.

However, in the special case where the given

linear equation system arises from a difference

approximation of a boundary-value problem, it can

be shown that the estimates are smoother in the

modified method than in the original. This may be

an advantage if the desired solution is to be differ-

entiated afterwards.

Concurrently with the solution of a given linear

system, characteristic roots of its matrix may be

obtained: compute the values of the polynomials

RQ,

Ri, . . . and P

o

, P

u

. . . at

X

by the iteration

P —R A-b-P- (3*6)

The last polynomial R

m

[X) is a factor of the charac-

teristic polynomial of A and coincides with it when

m=n.

The characteristic roots, which are the zeros

of i?m(\), can be found by Newton's methods without

actually computing the polynomial B

m

(\)

itself.

One uses the formulas

(3:7)

411

where

R

m

(\

k

),

R'

m

(\n) are determined by the iteration

(3:6) and,

R'

o

=P'

o

=0

R'

i+

i=R'i-'Ka

i

P'

(

-a

i

P

t

with \=\

k

. In this connection, it is of interest to

observe that if m=n, the determinant of A is given

by the formula

det 01) =

The cg-method can be extended to the case in

which A is a general nonsymmetric and nonsingular

matrix. In this case one replaces eq (3:1) by the set

r

o

=k—Ax

o

,

p

o

=A*r

o

,

(3:8)

p

i+1

=A*r

i+1

-

This system is discussed in section 10.

4.

Method of Conjugate Directions (cd-

Method)

1

The cg-method can be considered as a special case

of a general method, which we shall call the method

of

conjugate

directions or more briefly the cd-method.

In this method, the vectors p

0

, p

l7

. . . are selected

to be mutually conjugate but have no further restric-

tions.

It consists of the following routine:

Initial step. Select an estimate x

0

of h (the solu-

tion),

compute the residual r

o

=k—Ax

o

, and choose a

direction p

0

.

General routine. Having obtained the estimate

Xt of h, the residual r

t

=k—Ax

t

and the direction p

t

,

compute the new estimate x

i+1

and its residual

r

i+i

by the formulas

a =

1

(

r

l+1

=r

i

—a

i

Ap

i

.

(4:1a)

(4:1b)

(4:1c)

1

This method was presented from a different point of view by Fox, Huskey, and

Wilkinson on p. 149 of a paper entitled "Notes on the solution of algebraic linear

simultaneous equations," Quarterly Journal of Mechanics and Applied Mathe-

matics c. 2, 149-173 (1948).

Next select a direction

p

i+l

conjugate to p

0

,-

that is, such that

Pi,

(4:2)

In a sense the cd-method is not precise, in that no

formulas are given for the computation of the direc-

tions p

Oj

pi, . . . . Various formulas can be given,

each leading to a special method. The formula

(3:

If) leads to the cg-method. It will be seen in

section 12 that the case in which the p's are obtained

by an A-orthogonalization of the basic vectors

(1,0,

. . ., 0), (0,1,0,

...),...

leads essentially to

the Gauss elimination method.

The basic properties of the cd-method are given by

the following theorems.

Theorem 4:1. The direction vectors p

0

, pi, • • • are

mutually conjugate. The residual vector r

t

is orthogonal

to po, Pu - • •, Pi-i- The inner product of p

t

with each

of the residuals r

0

, r

u

••-,/•< is the same. That is,

(4:3a)

(4:3b)

(4:3c)

(4:4)

in

place

of (4:1a).

Equation (4:3a) follows from (4:2). Using

(4:

lc),

we find that

=

(PJSJC)—a

k

(pj,Ap

k

).

If j=k we have, by (4:1a), (p

k

,r

k+l

) = 0. Moreover,

by (4:3a)

(p

3

,r

k+1

)

=

(p

j

,r

k

),

(j'^k).

Equations (4:3b)

and (4:3c) follow from these relations. The formula

(4:4) follows from (4:3c) and (4:1a).

As a consequence of (4:4) the estimates

Xi,x

2

,

• • •

of h can be computed without computing the resid-

uals r

o

,ri, • • •, provided that the choice of the

direction vectors

p

o

,pi,

- • • is independent of these

residuals.

Theorem 4:2. The

cd-method

is an mstep method

(m

^ n) in the sense that at the mth step the

estimate

x

m

is the

desired

solution h.

For let m be the first integer such that y

o

=h—x

o

r

is

in the subspace spanned by p

0

, • • •, p

m

-\. Clearly,

m^ n, since the vectors

p

o

,pi,

• • • are linearly inde-

pendent. We may, accordingly, choose scalars

<*o>

' '

'•>

a

m-i such that

The

scalar

a

t

can

be

given by

the

formula

-CL

m

-lVm-\-

Hence,

h=x

o

+a

o

p

o

+ '

Moreover,

r

o

=k—Ax

o

=A(h—x

o

) =

(

412

Computing the inner product

(Pi,r

0

)

we find by (4:3a)

and (4:4) that a

t

=a

iy

and hence that h=x

m

as was

to be proved.

The cd-method can be looked upon as a relaxation

method. In order to establish this result, we intro-

duce the function

,k).

(4:5)

Clearly, f(x)^0 and f(x)=0 if, and only if, x=h.

The function f(x) can be used as a measure of the

"goodness" of x as an estimate of h. Since it plays

an important role in our considerations, it will be

referred to as the

error

function. If p is a direction

vector, we have the useful relation

(4:6)

where r=k—Ax=A(h—x), as one readily verifies

by substitution. Considered as

a

function of

a,

the function f(x-\-ap) has

a

minimum value

at

a=a, where

This minimum value differs from/(x) by the quantity

(v r)

2

J\X)

j\X\~ap)

==z

a

\jp,^x.pj — ~/ ~i r*

(^4

. o)

KP>Ap)

Comparing (4:7) with (4:1a), we obtain the first two

sentences of the following result:

Theorem 4:3. The point x

t

minimizes f(x) on the

line x=Xi-i +

api_i.

At the i-th step the

error

f(x

t

_i)

is

relaxed by the

amount

fixi-j-fixi)^^-

1

'^

2

,-

(4:9)

In fact,

the

point x

t

minimizes f(x) on

the

i-dimensional

plane P

t

of points

(4:10)

where a

0

, ..., a

t

_i are parameters. This plane con-

tains the points

x

Q

,X\,...,

x

t

.

In view of this result the cd-method is a method

of relaxation of the error function/(x). An iteration

of the routine may accordingly be referred to as

a relaxation.

In order to prove the third sentence of the theorem

observe that at the point (4:10)

f(x)=f(x

o

)-T

J

tta&j^

j=o

At the minimum point we have

and hence a.j=a

h

by (4:4). The minimum point is

accordingly the point

Xt,

as was to be proved.

Geometrically, the equation f(x) = const, defines

an ellipsoid of dimension n—1. The point at which

f(x) has a minimum is the center of the ellipsoid and

is the solution of Ax=k. The i-dimensional plane

Pi, described in the last theorem, cuts the ellipsoid

f(x)=f(x

0

) in an ellipsoid

E

t

of dimension i—1,

unless E

t

is the point x

0

itself.

(In the cg-method,

Ei is never degenerate, unless x

o

=h.) The point x

t

is

the center of E

t

. Hence we have the corollary:

Corollary 1. The point

x

t

is

the center of the

(i-l)-dimensional

ellipsoid

in

which the

i-dimensional

plane P

t

intersects the

(n—

1)-dimensional ellipsoid

f(x)=f(x

0

).

Although the function f{x) is the fundamental

error function which decreases at each step of the

relaxation, one is unable to compute f(x) without

knowing the solution h we are seeking. In order to

obtain an estimate of the magnitude oif(x) we may

use the following:

Theorem 4:4. The error vector y=h—x, the residual

r=k—Ax, and the error function f(x) satisfy the

relations

(4:11)

(4:12)

where n(z) is the Rayleigh quotient

The Rayleigh quotient of the error vector

y

does not

exceed that of the residual r, that is,

Moreover,

(4:13)

(4:14)

The proof of this result is based on the Schwarzian

quotients

{

(Az,A

2

z)

{

(z,z) = (z,Az) =(Az,Az)'

^'

Lb)

The first of these follows from the inequality of

Schwarz

(4:16)

by choosing p="Z, q=Az. The second is obtained

by selecting p=Bz, q=B

z

z, where

B

2

=A.

In order to prove theorem 4:4 recall that if we

set y=h—x, then

r=k—Ax=A(h—x)=Ay

f(x)=(y,Ay)

by (4:5). Using the inequalities (4:15) with z=y,

we see that

^(Ay,A*y)

(y,y)

=

(y.Ay)

-f(x)=(Ay,Ay)

_(r,Ar)_

(r,r)

'

413

Methods of Conjugate Gradients for Solving Linear Systems

Citations

Numerical Optimization

Deep learning in neural networks

Iterative Methods for Sparse Linear Systems

Phosphorene: An Unexplored 2D Semiconductor with a High Hole Mobility

A Rapidly Convergent Descent Method for Minimization

References

An iteration method for the solution of the eigenvalue problem of linear differential and integral operators

Solution of systems of linear equations by minimize iteration

Applications of the theory of quadratic forms in Hilbert space to the calculus of variations.

Hydraulic resistance effect upon the dam-break functions

Absorption spectrum of water vapor between 4.5 and 13 microns

Related Papers (5)

Function minimization by conjugate gradients

GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems

Iterative Methods for Sparse Linear Systems

Matrix computations

BI-CGSTAB: a fast and smoothly converging variant of BI-CG for the solution of nonsymmetric linear systems