Optimal Bounds for Decision Problems
on the CRCW PRAM
PAUL BEAME
University of Wushington, Seattle, Wushington
AND
JOHAN HASTAD
Royal Institute of Technology, Stockholm, Sweden
Abstract. Optimal Q(logn/log logn) lower bounds on the time for CRCW PRAMS with polynomially
bounded numbers of processors or memory cells to compute parity and a number of related problems
are proven. A strict time hierarchy of explicit Boolean functions of n bits on such machines that holds
up to O(logn/loglogn) time is also exhibited. That is, for every time bound T within this range a
function is exhibited that can be easily computed using polynomial resources in time T but requires
more than polynomial resources to be computed in time T - 1. Finally, it is shown that almost all
Boolean functions of n bits require logn - loglogn + fi( 1) time when the number of processors is at
most polynomial in n. The bounds do not place restrictions on the uniformity of the algorithms nor on
the instruction sets of the machines.
Categories and Subject Descriptors: F.
1.2 [Computation by Abstract Devices]:
Modes of Computation-
parallelism; F. 1.3
[Computation by Abstract Devices]:
Complexity Classes-complexity hierarchies,
relations among complexify measures;
F.2.3 [Analysis of Algorithms and Problem Complexity]:
Trude-
offs among complexity classes
General Terms: Theory, Verification
Additional Key Words and Phrases: Concurrent-write, lower bounds, parallel random-access machines,
parity, sorting
I. Introduction
One of the most widely used models of parallel computation is the parallel random
access machine (PRAM). In this model any processor can access any memory
location at a given time-step. The most powerful form of the PRAM, the CRCW
PRAM, in which both concurrent read and concurrent write accesses are allowed,
has received particular attention both from designers of algorithms and from those
The work of P. Beame was supported by a University of Toronto Open Fellowship and by National
Science Foundation grant PYI-25800. The work of J. Hastad was supported by an IBM Postdoctoral
Fellowship and supported in part by NSF grant DCR MCS-85-09905.
This research was done while P. Beame was at the University of Toronto and while both authors were
at the Massachusetts Institute of Technology.
Authors’ present addresses: P. Beame, Computer Science Department, FR-35, University of Washington,
Seattle, Washington 98195; J. Hastad, Royal Institute of Technology, Stockholm, S-100-44, Sweden.
Permission to copy without fee all or part of this material is granted provided that the copies are not
made or distributed for direct commercial advantage, the ACM copyright notice and the title of the
publication and its date appear, and notice is given that copying is by permission of the Association for
Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.
0 1989 ACM 0004-54 1 l/89/0700-0643 0 1.50
Journal of the Association for Computing Machinery, Vol. 36, No. 3, July 1989. pp. 643-670.
644
P. BEAME AND J. HASTAD
studying the limitations of parallel machine computation. Despite the significant
interest, the only nontrivial lower bounds for decision problems on CRCW PRAMS
that do not have drastic restrictions placed on either their processor and memory
resources or on the instruction sets of their processors are due independently to
Beame [3] and to Li and Yesha [ 131. The lower bounds are for parity and related
problems and are far from optimal. In both of these bounds no restriction is placed
on the instruction set of the processors, no limitation is placed on how much
information a single memory location may store, and the resources allowed are
only polynomially bounded. We call a machine with these properties an
abstract
or
ideal PRAM.
In this very general setting we prove the first optimal bound for any non-
trivial decision problem on the CRCW PRAM by showing a time lower bound of
Q(logn/loglog n) for parity that matches the known upper bound. This lower
bound holds even in the cases when only one of the two resources, processors or
memory cells, is bounded by a polynomial in the input size. Because parity
constant-depth reduces to a large number of problems, this O(log n/log log n)-time
lower bound for the CRCW PRAM applies to a wide variety of interesting functions
that include sorting or adding y1 bits, as well as multiplying two n-bit integers.
Also, by looking at the so-called “Sipser” functions, which are defined by
circuits, we obtain a very sharp time hierarchy for CRCW PRAMS of polynomial-
bounded resources. That is, for every time bound r(n) at most log n/(3 loglogn) -
@(log n/(log log n)2 we exhibit a family of functions which is computable in time
bound
T
with n processors and memory cells, but which cannot be computed just
one step faster by any machine with a polynomial bound on the number of
processors even with no bound on the number of memory cells. A similar separation
holds for machines with a polynomial bound on the number of memory cells even
without a bound on the number of processors.
The proofs of both these results follow lines similar to the proofs in [2] and [3]
and involve new lemmas that generalize the key lemmas used in Hastad’s un-
bounded fan-in circuit lower bounds [lo] and [ 111.
We also prove a tight 8(logn) lower bound on the time to compute almost
all n-bit Boolean functions on CRCW PRAMS with polynomial numbers of
processors.
A preliminary version of these results appeared in [5]. Many of these results also
form a part of the first author’s Ph.D. dissertation [4].
2. History of the Problem
Much of the lower bound work for CRCW PRAMS has been based on their close
relationship to unbounded fan-in circuits. These were defined by Furst et al. [9]
largely as a tool for trying to get an oracle to separate the polynomial-time hierarchy
from PSPACE. Stockmeyer and Vishkin [ 151 showed that simple CRCW PRAMS
can simulate unbounded fan-in circuits with essentially the same number of
processors as the circuit size and the same time as the circuit depth. In fact, by
restricting the instruction set of the CRCW PRAM to a limited set that includes
addition, comparison, indirect addressing and a few related instructions, Stock-
meyer and Vishkin also showed that unbounded fan-in circuits can easily simulate
restricted CRCW PRAMS. The size of the resulting circuit is polynomial in the
number of processors multiplied by the time and its depth is only a constant factor
larger than the time. Using the latter result and a Q(log*n) lower bound of Furst et
al. [9] on the depth of polynomial size unbounded fan-in circuits computing parity,
Optimal Bounds for Decision Problems on the CRCW PRAM
645
Stockmeyer and Vishkin [ 151 obtained lower bounds for this restricted form
of CRCW PRAM.
Because disjunctive normal-form formulas are unbounded fan-in circuits of
depth two it follows that all Boolean functions may be computed in two steps using
exponential resources on the CRCW PRAM. However, it is not reasonable to be
using exponentially many processors and memory cells. With polynomial resource
bounds, CRCW PRAMS can compute any function with formula size no(‘) in time
O(logn/loglogn), using an algorithm based on an upper bound of size no(‘) and
depth O(log n/loglog n) for unbounded fan-in circuits given by Chandra et al. [7].
Since Stockmeyer and Vishkin’s paper, the lower bounds for unbounded fan-in
circuits have been significant1 improved. Ajtai, extending [ 11, and L. Babai (private
communication) derived Q(
Js-
log n) depth lower bounds for polynomial size circuits
computing parity. Yao [ 161 markedly improved these results by showing truly
exponential size lower bounds for circuits of constant de th but this improvement
did not increase the depth lower bound beyond 1;2( log n). Finally, Hastad [lo]
se
using some techniques similar to those used by Yao, obtained an Q(log n/loglog n)
depth lower bound for polynomial-size circuits computing parity, which matches
the bound from the algorithm of Chandra et al. However, the CRCW PRAM lower
bounds that follow using Stockmeyer and Vishkin’s simulation are still not entirely
satisfactory since the bounds rely in an essential way on the specific restriction that
is placed on the instruction set. Some operations that are prohibited in this model
seem to be perfectly reasonable ones.
Abstract CRCW PRAMS can be shown to be much more powerful than these
restricted machines; because of their equivalence with unbounded fan-in circuits,
restricted CRCW PRAMS with polynomially many processors require exponential
time to compute almost all Boolean functions whereas an abstract PRAM only
takes O(logn) time without even using its power of concurrent reads or writes.
Nevertheless, for certain specific functions we shall see that, by using direct
techniques, lower bounds as strong as those derived for these restricted CRCW
machines can be obtained for the most powerful model of CRCW PRAM.
By applying and modifying the techniques of [9], Beame [2] derived the first
nontrivial lower bound that applies to the CRCW PRAM model described here.
He showed that any CRCW PRAM computing the parity function with nO”’
memory cells and an unbounded number of processors requires time Q(e).
Later, using the main lemma in [lo], Beame [3] obtained the following: any CRCW
PRAM thaiJgmputes the parity function with n O(I) processors (in fact with as
man asn-
+
processors for some 6 > 0) and unbounded memory requires time
52( log n). With the same techniques, an a( &) lower bound is easily shown for
common-write CRCW PRAMS (for definitions, see Section 3) that have no bound
on the number of processors but have a bound of O(nz6*) on the num-
ber of cells for some 6 > 0.
It was shown by B. Chor (private communication) and Li and Yesha [ 131 that a
simulation of abstract CRCW PRAMS by unbounded fan-in circuits can be
combined directly with Hastad’s circuit lower bound to obtain the a( 6) lower
bound. However, this simulation does not yield the above lower bound for the
common-write model with an unbounded number of processors. The simulation
states that any CRCW PRAM solving a decision problem on n Boolean inputs
using p(n) processors and T(n) time can be simulated by an unbounded fan-in
circuit of size
p(n)2”“‘+0”)
and depth O(r(n)).
Beame [3] and Li and Yesha [ 131 have also independently shown optimal bounds
on the time needed by CRCW PRAMS to compute functions whose many-bit
646
P. BEAME AND J. HASTAD
output is required to appear in a single memory cell. However, as was noted in
[3], such an output requirement is somewhat artificial and the lower bounds
disappear if each bit of the output is allowed to appear in a separate memory cell.
3. Definitions and Preliminaries
Definition. A CRCW PRAM is a shared memory machine with processors
p, , . . . , p,,c,,,
which communicate through memory cells C, , . . . , C,., ,,,. The
values of the input variables xl, . . . ,
x,, are initially stored in the first y1 cells of
memory C,, . . . . C,,, respectively. Initially all cells other than the input cells
contain the value 0. The output of the machine is the value in the cell C, at
termination.
Before each step t, processor P, is in state q:. At time step t, depending on qi,
processor P; reads some cell Cj of shared memory, then, depending on the contents,
(C,), and q:, assumes a new state q:+’ and depending on this state, writes a value
v = v( q:“) into some cell.
When several processors are attempting to write into a single cell at the same
time step the one that succeeds will be the lowest numbered processor. (A CRCW
PRAM is defined to be a common-write machine if, whenever several processors
are attempting to write into the same cell at a given time step, they all try to write
the same value.)
The CRCW PRAM defined above has been called the PRIORITY CRCW
PRAM and is the most powerful version of CRCW PRAM normally considered.
Thus lower bounds for this model will apply to any standard model of CRCW
PRAM.
In studying the progress of CRCW PRAM computations, what is important is
the set of inputs which lead to a given value in a memory cell or a given state of a
processor at a particular time step. The computation then may be viewed as
operating not on actual values so much as on the partitions associated with them.
Definition. Let M be a CRCW PRAM. For any processor Pi the processor
partition, P(M, i, t), of the input set at time step t is defined so that two inputs are
in the same equivalence class of P(M, i, t) if and only if they lead to the same state
of processor P, at the end of time step t.
For any cell C, the cell partition, C(M, j, t), of the input set at time t is defined
so that two inputs are in the same equivalence class of C(M, j, t) if and only if they
lead to the same contents of cell Cj at the end of time step t.
At time 0, the cell partitions for the first n memory cells have exactly two
equivalence classes, one consisting of those inputs for which the value of the
variable in the cell is 0, the other consisting of those inputs for which the value of
that variable is 1. Initially all other processor and cell partitions have only one
equivalence class consisting of all the inputs.
We now look at a measure of the complexity of partitions that was used in [2]
and [3] to prove lower bounds for CRCW PRAMS.
Definition. Letfbe a Boolean function defined on a set I C (0, 1)“. A Boolean
formula F represents f on I if the inputs x E I satisfy F exactly when f(x) = 1. Let
the maximum clause length of a DNF formula F be the maximum number of
literals in any clause of F. The (Boolean) degree off on I, S(f), is the smallest
maximum clause length of all disjunctive normal form (DNF) formulas represent-
ing f on I. We extend this definition to sets of functions 9 by letting 6(Y) =
maxlE.i WI.
Optimal Bounds for Decision Problems on the CRCW PRAM 647
The terminology of degree is derived from the standard way of writing a formula
with the Boolean
V
as addition and the Boolean A as multiplication and then
viewing the resulting formula as a polynomial. This should not be confused with
the degree of a polynomial in the finite field of two elements where the exclusive-
OR rather than the
V
is the appropriate additive operation.
In the notation of many lower bound proofs for monotone formulas, we could
define the prime implicants and prime clauses of a Boolean function f: (Prime
clauses are essentially prime implicants of T) These have been described as
minterms and maxterms, respectively, in the notation used by Yao [ 161 or Hastad
[lo]. Observe that the degree of a function and the length of its longest minterm
or maxterm may differ because its longest minterm may be longer than the longest
clause in an optimal DNF formula representing it. Consider the function
f
defined
by the DNF formula x1x2x3 + X,x4x5. It has a minterm
~2~3~4x5,
which is larger
than 6(f).
Definition. Let A be a partition of a set I C (0, 11”. Define the degree of A,
6(A), to be 6(55) on I where 35 is the set of characteristic functions of the
equivalence classes of A in I.
The major proof technique of the lower bounds for parity on unbounded fan-in
circuits is the use of restrictions to set some of the input bits. Using restrictions
permits a simplified description of the results of computations but does not
drastically reduce the difficulty of the function being computed. The main idea
behind using them is that, although apparently complex operations like the OR of
n bits are computed in one step, by setting relatively few inputs to 0 or 1 the results
of these operations are simple. In the case of the OR of n bits, setting a single input
to I makes it trivial.
Definition. A restriction 7r on K C ( 1, . . . , n) is a function 7r: K -+ (0, 1, *)
where:
1
1
means x, is set to 1,
7r(i) = 0 means x, is set to 0,
*
means xi is unset.
We define the results of applying a restriction P to a partition, Al,, a function,
f r,,
a Boolean formula, Ff,, a circuit, CT,, as well as sets of these objects, Zf, etc., in
the natural way. If u and 7 are restrictions, then UT is a restriction that is the result
of applying u first and then applying 7. For any Kc (1, . . . , n) define Proj[K] to
be the set of restrictions that assign 0 or 1 exactly to the inputs in K.
Definition. If a circuit D is Cl, for some restriction r, then we say that C
contains D and the gates of C that remain undetermined in D will be said to take
on the value * in C when K is applied.
In several places we need the following simple observation.
LEMMA
3.1. Let A be a partition of a set I G (0, 11”. For every K C ( 1, . . . , n),
there exists a restriction ,J E Proj[K] such that 6(A) 5 1 K 1 + 6(Ar,).
PROOF.
For each u E Proj[K] let 9$ be a set of DNF formulas that represent
the characteristic functions of the equivalence classes in AT,, and that have
maximum clause length bounded by 6(Ar,). To each clause of every formula
in 9& append the conjunctive clause C,, which is true exactly on those inputs in
(0, 11” that agree with u, to obtain a set of formulas E. By construction, the