what is the simplest way to evaluate a molecule?

Now X1 = LiO,R, = (F,/G1)Oi(F2/G2) = F3/G3, where{F1G2-~-F2G, if 0 , = "+",~ ~GiG2 if 0 , = "'~-" o r " * " , } F~ = ~F,F2 if 0, "* '" The authorand Ga = (G1F~ if 0, " / "[F1G2 if 01 " / " ,Hence F3 and G3 can be evaluated in time k - 4.

(Open Access) The Parallel Evaluation of General Arithmetic Expressions (1974) | Richard P. Brent

Q: What contributions have the authors mentioned in the paper "The parallel evaluation of general arithmetic expressions" ?

In this paper, it was shown that arithmetic expressions with n > 1 variables and constants, operations of addition, multiplication, and division, and any depth of parenthesis nesting can be evaluated in time 4 log 2n + 10 ( n 1 ) /p using p > 1 processors which can independently perform arithmetic operations in unit time.

Q: What is the first step in the inductive hypothesis?

In the first k - 6 steps the authors compute F1, G,, F~, G2 and start computing A1, B1, C1, and Dt , using Pi(I L, [) + Pl([ R1 [) + P2(] E1 I) processors.

Q: What is the proof of theorem 2?

I f p processors which can perform " ~ " and " , " in unit time are available, then E can be evaluated in time 4 log2n -b 2 (n -- 1)/p.A proof of Theorem 2 is given in [4], where the authors also show that , for real expressions and approximate arithmetic, the evaluation of E in the t ime given by Theorem 2 is numerically stable (in the sense that the computed result can be obtained by making small relative changes in the values assigned to She atoms and then performing exact ari thmet ic) .

Q: what is the t- n of x?

LEMMA 2. ~ff a computation C can be performed in time t with q operations and sufficie.ntly many processors which perform arithmetic operations in unit time, then C can be" performed in time t -4- (q -- t ) /p with p such processors.

Q: what is the inductive hypothesis for x2?

since The authorR21 <_ n - 1, part (1) of the inductive hypothesis shows that R~ = F4/G4, where F4 and G4 can be evaluated in time k -- 2 with P~ (I R2 [) processors and Q1 ([ R2 l) operations.

Q: What is the number of processors required to compute F and G?

Since X, = F3/G3, it follows that E = F/G, where F = A1F3 + B1Ga and G = CxF3 + D1G3 can be evaluated in time k -- 2.Consider the number of processors required to compute F and G as above.

Q: how many times can p processors perform addition, multiplication, and division in time?

- 1 [-s,/p-1 _< (1 -- 1/p) t -4- ( l /p ) ~ = 1 s, = t -4- (q -- t ) /p .COROLLARY 1. Let E be as in Theorem The authorand suppose that p processors which can perform addition, multiplication, and division in unit time are available.

The Parallel Evaluation of General Arithmetic Expressions

RICHARD P.

BRENT

Australian Nalional University, Canberra, Australia

ABSTRACT. It is shown that arithmetic expressions with n > 1 variables and constants; operations

of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in

time 4 log2n + 10(n -

1)/p

using p > 1 processors which can independently perform arithmetic

operations in unit time. This bound is within a constant factor of the best possible. A sharper result

is given for expressions without the division operation, and the question of numerical stability is

discussed.

KEY WORDS AND PHRASES:

arithmetic expressions, compilation of arithmetic expressions, compu-

tational complexity, general arithmetic expressions, numerical stability, parallel computatioR,

code optimization

ca CATEOORIES. 4.12, 5.11, 5.25

1. Introduction

The question of how quickly arithmetic expressions can be evaluated on a computer with

several independent arithmetic processors is of theoretical and practical interest. In this

paper we determine the answer to within a constant multiplicative factor (see Corollary

2 in Section 4). All our proofs are constructive, and reasonably efficient algorithms for

compiling expressions for subsequent execution on a parallel computer may be derived

from our proofs. These algorithms compare favorably with those given in [1, 2].

We assume that a number of processors are available and that each can perform an

arithmetic operation (addition, multiplication, and sometimes division) in unit time.

The time required for accessing data, storing results, communicating between processors,

etc., is ignored. Also, the effect of rounding errors is neglected, except in Section 5. The

results hold for exact arithmetic with expressions over any commutative field.

Several special cases have been considered previously. For example, Maruyama [14]

and Munro and Paterson [19] have shown that polynomials of degree n can be evaluated in

time log2n ~ 0 ((log2n) ~) if sufficiently many processors are available, and Brent [3] has

shown that this is true for expressions of the form ao --{- xl (a~ ~ x2 (a2 T "" (a~_l T

a,x,) • • • ) ).

Baer and Bovet [1] and 5~uraoka [20] considered expressions with n distinct

variables and operations of addition and multiplication over a commutative ring. It has

recently been shown in [5] that such expressions can be evaluated in time 2.465 log2n if

sufficiently many processors are available. (For results that apply if a fixed number of

processors is available, see Section 5.) Kuck and Maruyama [12] have shown that con-

tinued fractions of the form

bo + aJ (bl + a2/ (. . . (b~-i + a~/b~). • • ))

can be evaluated

in time 2 log:n + 0 (1). Kuck [10], Maruyama [15], and Muraoka [20] have bonsidered

expressions with a limited depth of parenthesis nesting and/or a limited number of

divisions. See also [6, 8, 9, 13, 18] and the references given there.

but not for profit, all or part of this material is granted provided that ACM's copyright notice is

given and that reference is made to the publication, to its date of issue, and to the fact that reprinting

privileges were granted by perm,ission of the Association for Computing Machinery.

Author's address: Computer Centre, Australian National University, P.O. Box 4, Canberra, A.C.T.

2600, Australia.

Journal of the Association for Computing Machinery, Vol. 21,

No. 2, April 1974, pp. 201-206.

202 R.P. BRENT

Our results (Corollary i and Theorem 2) show that parallelism may be used to speed

up the evaluation of large arithmetic expressions. Knuth [7] has shown that most expres-

sions which occur in real FORTRAN programs have only a small number of operands.

Nevertheless, our results (or the method used to obtain them) may ultimately be of

practical value, for Kuck [11] has shown that an optimizing compiler for a parallel

machine might generate large expressions when compiling programs like those studied by

Kmlth [7].

In this paper we assume commutativity, but Maruyama [16] has recently extended

some of our results to expressions over noncommutative rings (e.g. rings of matrices).

2. Notation and Assumptions

We consider well-formed arithmetic expressions with the operations addition

("-F"),

multiplication

("."),

and division ("/") ; any level of parenthesis nesting; and distinct

indeterminates (or "atoms") xl, x2, ... over a commutative field. We neglect the sub-

traction operation because expressions containing it can easily be transformed into

equivalent expressions with" +",

"*","/"

and (at most) some unary subtractions acting

onatoms, e.g.a-

(b-~ c/(d-- e) -- f) = a+ ((-b) + c/((-d) We) -~ f).

']?he restriction to expressions with distinct atoms means that we do not consider ex-

pressions such as

a -4- x(b + x(c + x) ), a + 1/ (b + 1/ (c +

l/d)), and x l°°.

However, our results give upper bounds on the time required to evaluate such expres-

sions, because they apply to the more general expressions

a ~ ~l(b ~ x2(c ~ x3)),

a ~ Ul/(b + u2/(e + ua/d)),

and

xlx2 ... Xloo

respectively. For further discussion and

examples, see [5].

If E is an arithmetic expression then [ E I denotes the number of atoms (relabeled if

necessary to become distinct) in E. If T is a parse tree for E then [ T [ = [ E[ is the

number of terminal nodes of T. If [ T I > 1 we write T = L R, where L and R are the

maximal proper subtrees of T. A subexpression of E is the expression corresponding to a

subtree (not necessarily proper) of a parse tree for E.

If r is a real number then Fr"] denotes the integer satisfying r _~ Fr~ ~ r + 1.

3. Main Theorem

Theorem i states slightly more than we use subsequently, but the statement is necessary

so that the result may be proved by induction. The most interesting consec~uences of the

theorem are stated in Corollaries 1 and 2 (Section 4).

We first state, without proof, a trivial but useful lemma.

iLEMMA 1.

If 1 ~ m ~ n and T is a binary tree with I T I = n, then there is a subtree

X1 = L~ Rl of T such that [ X I ~- m, I Ll l < m, and l Rl l < m. Als°, if x is °ne °f the

terminal nodes of T, there is a subtree X2 = L2 R~ of T such that [ X2 [ >_ m and either

(1)

x is a terminal node of L2 and I L2 [ < m, or

(2)

x is a terminal node of R2 and I R2 [ <

THEOR~ 1.

Let E be any arithmetic expression with n (distinct) atoms and operations

"-~-", "*", and "/" over a commutative field. Suppose that su~eiently many processors

capable of performing "+" and "." (but not necessarily "/") in unit time are available.

Let P1 (n) = 3 (n -

1), P2 (n) =

max(o, 3n -

4), Qi(n) =

max(O,

lOn- 19), Q2(n) =

max

(O, lOn - 29),

and

~n --~ 1 /f n<2,

k = (W41og2(n--

1)3 /f n_>3.

Then

(1)

and (2) below hold:

(1)

E = F/G, where F and G are expressions which can be evaluated simultaneously in

time k -- 2 with P1 (n) processors and Q1 (n) operations.

The Parallel Evaluation of General Arithmetic Expressions

203

(2)

If x is any atom of E, then E = (Ax + B)/ (Cx + D), where A, B, C, and D are

expressions which do not contain x and which can be evaluated simultaneously in time k with

P2(n) processors and Q2 (n) operations. (Note that some of A, ... , G may be identically

0 or

1.)

PROOF. By inspection, the result holds for n < 4, so we assume that n = N >_ 5

(so k > 8). The proof is by induction on N. As inductive hypothesis we assume that

parts (1) and (2) of the theorem hold for n < N.

We shall show that part (1) holds with n = N. Applying Lemma 1 with

m = F (n -{- 1)/27 to a parse tree for E, we see that there is a subexpression Xt =

L101R1of E such that

IX1 I _> (n ~- 1)/2,

151[ < n/2, IR, I -< n/2,

and 01 = "-F",

",", or "/".

From the definition of k, n _~ 2 kI4 + 1 ; so

I Lt I ~- n/2 <

+ I, and similarly for

Ri • Thus, by part (I) of the inductive hypothesis, Li =

F,/Gi

and R, =

F~/G~,

where

Fi, Gi, F2, and G2 can be evaluated simultaneously in time (k -- 4) -- 2 = k -- 6

with P,(I Lt I) + P,(I R1 I) processors and QI(I Lt I) + Qi(I R1 I) operations.

Now

X1 = LiO,R, = (F,/G1)Oi(F2/G2) = F3/G3,

where

{F1G2-~-F2G,

if 0,=

"+",~ ~GiG2

if 0,=

"'~-"

or"*",}

F~ = ~F,F2

if 0, "*'" I and Ga =

(G1F~

if 0,

"/"

[F1G2

if 01 "/",

Hence F3 and G3 can be evaluated in time k - 4.

Let Et be the expression formed by replacing Xt by an atom in E. Since [ E1 t =

n + 1 - [Xi[ < (n + 1)/2 < 2 (k-4)/4 + 1, part (2) of the inductive hypothesis (applied

to E~) gives E =

(AiX~ + B~)/(C1X~ + DQ,

where A1, B1, C,, and Di can be evaluated

simultaneously in time/c - 4 with P2 ([ E1 {) processors and Q~ (I Ell) operations. Since

X, = F3/G3,

it follows that E =

F/G,

where F =

A1F3 + B1Ga

and G =

CxF3 + D1G3

can be evaluated in time k -- 2.

Consider the number of processors required to compute F and G as above. In the first

k -

6 steps we compute F1, G,, F~, G2 and start computing A1, B1, C1, and Dt, using

Pi(I L, [) + Pl([ R1 [) + P2(] E1 I) processors. From time k -- 6 to/c -- 4 we compute

Fa and G3 and finish computing A1, B~, C~, and D,, using 2 + P2(I Et I) processors.

Finally, from time k -- 4 to k -- 2 we compute F and G, using four processors. Thus,

the number of processors required is

max [Pl([ L, [) + P,(I R~ I) + P2(I E~ I), 2 + P~(I E1 I), 4]

= max[3(IL, I +]Rll + IE, I) - 10,3(ILll + IR,]) - 6,31Etl - 2,41

<3(n- 1)

= P~(n),

aslL~ I + IRll + IEll = n+ 1, ILtl +lRll <n, lEvi _~ (n+ 1)/2, andn > 2.

Now consider the number of operations required to compute F and G as above..Since

3 _~ (n + i ) /2 _~ I X, I = I LI I T I Rl l, the definition of Ql gives Q~ (I Ll l)

+Q,(IRII-~

10 (I L~ I + I Rt I) - 29. Thus, the number of operations is at most

10 + Qi(I L1

I) +

Q,(I R,

I) -i-

Q~(I E1

_< max [10(I L1 [ -t- l Ri I + [ E, l) - 48, 10(I L1 [ q- [Rt 1) - 191

< 10n -- 19 =

Ql(n),

so part (1) holds with

= N.

To complete the proof, we must show that part (2) holds with n = N. Let x be an

atom of E. Applying the second half of Lemma 1 with m = F(n q- 1)/2"1 to a parse

tree for E, we see that there is a subexpression X~ =

L~O~R2

of E such that ] Xz I >-

(n + 1)/2, 0~ = "q--", ".", or "/", and either x is an atom of L~ and I L~ I -<

n/2,

x is an atom of

and [

Rz [ <_ n/2.

We shall suppose that x is an atom of L~. (The

proof is similar if x is an atom of R2 .)

204 R.P. BRENT

Let E2 be the expression formed by replacing X2 by an atom in E. Thus I E~ I =

n "t- 1 -- I X2 ] < (n + 1)/2 < 2 (k-~)/~ + 1, and part (2) of the inductive hypothesis

(applied to E2) gives E = (A2X2 -4- B~)/(C~X2 -4- D2), where A~, B2, C2, and D~ can

be evaluated simultaneously in time k - 4 with P2(I E2 I) processors and Q2(I E~ t)

operations.

Similarly, L~ = (Asx + Bs)/(C3x "4- D~), where A~, Bs, C3, and D8 can be evaluated

in time k - 4 with P2(I L2 1) processors and Q2(I L2 t) operations. Also, since I R21 <_

n - 1, part (1) of the inductive hypothesis shows that R~ = F4/G4, where F4 and G4 can

be evaluated in time k -- 2 with P~ (I R2 [) processors and Q1 ([ R2 l) operations.

From X2 -- L282R~ and the above expressions for E, L2, and R2, we find that E =

(Ax "4- B)/ (Cx + D), where

{(A~C3)F4 + (A2A3 + B2Ca)G4 if 02 = "+",

A = ~(A2A3)F4-4- (B2C3)G4 ' if 02 = "*",

( (A~A3)G4 + (B~C3)F4 if 02 = "/",

and B, C, and D are given by similar expressions. Thus A, B, C, and D can be evaluated

in time k.

The number of processors required to compute A, • • • , D simultaneously in time k is

at :most

max [P2(] E~ I) "4- P2([ L2 ]) + P~(I R2 I), 8 "4- P~([ R2 {)]

= max [3(] E2I + IL~I-4-]R2I) - 11, 3(tL~I-4-]R2I) -7,

3(IE2 I-4-IR~I) -7, 3

IR~I + 51.

Since [ E~ [ -4-IL~[ + IR~I = n + 1, [n2[ "4- IRE[ < n, IE21 -4-IR~I _< n,

and n > 1, the number of processors required is at most 3n - 4 = P~(n) provided

31R~l + 5 _< 3n- 4, i.e. provided [R~[ < n- 3. If[R~[ = n - 2orn- 1, the

expressions for A, B, C, and D simplify, and a straightforward examination of cases

shows that P~ (n) processors suffice.

Similarly, if I E~ I > 2 and I L~ I > 2, the number of operations required is at most

28 + q2([ E~ I) + Q2([ L2 [) + Q~([ R2 [) _< 10n - 30 < q2(n). If[E~ I -< 2 or[ L~ [ < 2

or both, the expressions for A, B, C, and D simplify, and Q~ (n) operations suffice. This

completes the proof of part (2), so the theorem follows by induction on N.

4. Consequences of Theorem 1

We need the following lemma, which is of some independent interest.

LEMMA 2.

~ff a

computation C can be performed in time t with q operations and suffi-

cie.ntly many processors which perform arithmetic operations in unit time, then C can be"

performed in time t -4- (q -- t)/p with p such processors.

PROOF. Suppose that st operations are performed at step i, for i = 1, 2, • • • , t. Thus

Z,~-x s~ = q. Using p processors, we can simulate step i in time Fsdp'3 • Hence, the

computation C can be performed with p processors in time

~-1 [-s,/p-1 _< (1 -- 1/p)t -4- (l/p) ~=1 s, = t -4- (q -- t)/p.

COROLLARY 1. Let E be as in Theorem I and suppose that p processors which can perform

addition, multiplication, and division in unit time are available. Then E can be evaluated in

time 4 log2n + 10(n -- 1)/p.

PROOF. Suppose that n _> 3, for otherwise the result is trivial. By Theorem 1,

E = F/G, where F and G can be evaluated in time [-4 log2 (n -- 1)7 -- 2 < 4 log2n -- 1

with less than I0 (n - 1) operations. Applying Lemma 2 with t = r4 log2 (n - 1)'3 -- 2

and q = 10(n- 1), we see that F and G can be evaluated in time 41og2n- 1 +

10 (n -- 1)/p with p processors. Finally, E = F/G can be evaluated in one more unit of

time. (Note that only one division is performed, so the result is easily modified if a divi-

sion takes longer than an addition or multiplication.)

The Parallel Evaluation of General Arithmetic Expressions

205

COROLLARY 2.

Let r (n, p) be the maximum time required to evaluate arithmetic expres-

sions with n atoms, using p processors which can perform arithmetic operations in unit time.

Let

¢(n,

p) = max(log2n, (n--

1)/p).

Then, for all n >_ 1 and

p~ 1, ¢(n, p) _~

v(n, p) ~ 14~(n, p).

PROOF. Consider the expression x~ ~ x2 ~ ... + xn. By a fan-in argument, its

evaluation requires time at least logan. Also, at least n -- 1 operations must be performed,

so p processors require time at least

(n - 1)/p.

Hence, the lower bound on r(n, p) is

established. The upper bound follows from Corollary 1.

5. Concluding Remarks

Corollary 2 establishes the complexity of parallel evaluation of general arithmetic ex-

pressions to within a constant factor. The constant 14 can doubtless be reduced by more

refined arguments, and the lower bound for T (n, p) can be improved slightly (see [5]).

The proof of Theorem 1 simplifies, and the constants can be reduced, if division is

excluded. Corresponding to Corollary 1 we have the following, which is slightly weaker

than Theorems 1 and 2 of [5] if p ~ n, but much stronger if p is of order n or less.

THEOREM 2.

Let E be any arithmetic expression with' n (distinct) atoms and operations

" ~ " and "." over a commutative ring. If p processors which can perform " ~ " and "," in

unit time are available, then E can be evaluated in time 4 log2n -b 2 (n -- 1)/p.

A proof of Theorem 2 is given in [4], where we also show that, for real expressions and

approximate arithmetic, the evaluation of E in the time given by Theorem 2 is numeri-

cally stable (in the sense

that

the computed result can be obtained by making small

relative changes in the values assigned to She atoms and then performing exact arithme-

tic). Unfortunately, this result does not extend to expressions with division, and exam-

ples found by a program of Miller I17] show that the algorithm implied by the proof of

Theorem 1 is not always numerically stable. Hence, it is an open question whether gen-

eral arithmetic expressions can be evaluated stably in the time given by Corollary 1.

Acknowledgments.

David Kuck and Kiyoshi Maruyama made several stimulating

suggestions, without which this paper might not have been written. Webb Miller kindly

verified the numerical instability mentioned above, and a referee's comments were

useful in clarifying the proof of Theorem 1.

REFERENCES

BAER,

J. L.,

AND BOVET,

D. P.

Compilation of arithmetic expressions for parallel computa-

tions. Proc. IFIP Congr. 1968, North-Holland Pub. Co., Amsterdam, pp. 340-346.

BEATTY,

J.C. An axiomatic approach to code optimization for expressions.

J. ACM 19,

4 (Oct.

1972), 613-640.

BRENT,

R. P. On the addition of binary numbers.

IEEE Trans. Cornp. C-19

(Aug. 1970),

758-759.

BRENT,

R. P. The parallel evaluation of arithmetic expressions in logarithmic time. Proc.

Symposium on Complexity of Sequential and Parallel Numerical Algorithms (Carnegie-Mellon

U., Pittsburgh, Pa., May 1973), Academic Press, New York, 1973, pp. 83-102.

5. BRENT, R. P., KUCK, D. J., AND MARUYAMA, U. M. The parallel evaluation of arithmetic

expressions without division.

IEEE Trans. Comput. C-22

(May 1973), 532-534.

6. HOBOS, L. C. (Ed.) Parallel processor systems, technologies and applications. Spartan Books,

New York, 1970.

7. KNU'rH, D.E. An empirical study of FORTRAN programs.

Software i

(April 1971), 105-133.

8. KOGG~, P.M. Parallel algorithms for the efficient solution of recurrence problems; the nu-

merical stability of parallel algorithms for solving recurrence problems; and minimal parallel-

ism in the solution of recurrence problems. Stanford Electronics Lab. Reps. 43-45, Sept. 1972.

9. KOGGE, P. M., AND STONE, H. S. A parallel algorithm for the efficient solution of a general

class of recurrence equations. Rep. CS-72-298, Comput. Sci. Dep., Stanford U., Stanford, Calif.,

March 1972.

10. KVCK, D.J. Evaluating arithmetic expressions of n atoms and k divisionsin a (log2n ..{- 2 log2k)

c steps. Manuscript, March 1973.

The Parallel Evaluation of General Arithmetic Expressions

Citations

Cilk: An Efficient Multithreaded Runtime System

An introduction to parallel algorithms

The implementation of the Cilk-5 multithreaded language

Scheduling multithreaded computations by work stealing

Cilk: an efficient multithreaded runtime system

References

A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations

An empirical study of FORTRAN programs

On the Number of Operations Simultaneously Executable in Fortran-Like Programs and Their Resulting Speedup

Parallelism exposure and exploitation in programs

A Survey of Parallelism in Numerical Analysis

Related Papers (5)

Parallel Prefix Computation

An introduction to parallel algorithms

The Design and Analysis of Computer Algorithms

Scheduling multithreaded computations by work stealing

The implementation of the Cilk-5 multithreaded language

Frequently Asked Questions (11)

Q1. What contributions have the authors mentioned in the paper "The parallel evaluation of general arithmetic expressions" ?

Q2. What is the first step in the inductive hypothesis?

Q3. What is the proof of theorem 2?

Q4. what is the t- n of x?

Q5. How can the authors evaluate arithmetic expressions in time?

Q6. what is the inductive hypothesis for x2?

Q7. What is the title of the paper?

Q8. What is the number of processors required to compute F and G?

Q9. what is the simplest way to evaluate a molecule?

Q10. how many times can p processors perform addition, multiplication, and division in time?

Q11. how many processors are required to compute a molecule?