scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Superoptimizer: a look at the smallest program

01 Oct 1987-Vol. 22, Iss: 10, pp 122-126
TL;DR: The superoptimizer as mentioned in this paper is a probabilistic test that makes exhaustive searches practical for programs of useful size, where the search space is defined by the processor's instruction set, which may include the whole set but it is typically restricted to a subset.
Abstract: Given an instruction set, the superoptimizer finds the shortest program to compute a function. Startling programs have been generated, many of them engaging in convoluted bit-fiddling bearing little resemblance to the source programs which defined the functions. The key idea in the superoptimizer is a probabilistic test that makes exhaustive searches practical for programs of useful size. The search space is defined by the processor's instruction set, which may include the whole set, but it is typically restricted to a subset. By constraining the instructions and observing the effect on the output program, one can gain insight into the design of instruction sets. In addition, superoptimized programs may be used by peephole optimizers to improve the quality of generated code, or by assembly language programmers to improve manually written code.

Summary (2 min read)

2. An Interesting Example

  • It finds the shortest program that computes the same function as the source program by doing an exhaustive search over all possible programs.
  • The search space is defined by choosing a subset of the machine's instruction set, and the op-codes of these instructions are stored in a table.
  • Superoptimizer consults this table and generates all combinations of these instructions, first of length 1, then of length 2, and so on.
  • Two methods are used to reduce the search time.
  • The first is a fast probabilistie test for determining the the equivalence of two programs.

3.1. Boolean T e s t

  • One problem introduced by the probabilistie execution test is machine dependency.
  • The test works only if the instruction set being searched can be executed on the machine running the superoptimizer.
  • In other words, if the authors wish to change the instruction set, they would have to port the superoptimizer tothe new machine.
  • This port is not too difficult since the current version of superoptimizer is rather short (about 300 lines of 68020 assembly code), however it does require that one translate it into the target assembly code.

3.3. P r u n i n g

  • There are two ways that these bit tables can be filled.
  • A human can tell the bit table maker program to exclude all "move X,Y; move Y,X" sequences.
  • The program then scans all instructions in all dimensions of the bit matrix and sets the values accordingly.
  • One can also run superoptimizer with the boolean test, and have it find the equivalences on its own.

3.2. P r o b a b i l i s t l c t Test

  • It was found in practice that a program, has a very low probability of passing this execution test and failing the boolean verification test.
  • This fact proves very useful since most programs of interest have boolean expressions that are too large to fit in memory.
  • The authors can dispense with the boolean test and manually inspect the generated programs for correctness, without having to analyze a large number of wrong programs.
  • This manual check is not difficult since the programs are small (about 4 to 13 instructions).
  • Currently, superoptimizer runs without the boolean check, and the author has yet to find an incorrect program.

4. A p p l i c a t i o n s a n d L i m i t a t i o n s

  • Currently, the authors have only the 68020 version of the superoptimizer running the probabilistic test, so the instruction sets are restricted to subsets of the 68020 set.
  • The machine-independent version of superoptimizer is limited to very short programs.

4.2. Applications

  • In practice, the best use of superoptimizer has been as an aid to the assembly language programmer.
  • An experienced programmer can use superoptimizer to come up with nifty equivalent sequences for small sections of his code, while retaining the overall logical flow that makes a program maintainable.
  • This method has been used by the author (along with another program that optimizes code emulating state machines) to write the C library function prino ~ in only 500 bytes.

6. Conclusion

  • Find the absolute value of a number, excluding conditional jumps from the instruction set.
  • Notice that although it is longer than the classical method (test; jump-if-positive; negate), it has no jumps!.
  • This might actually be faster than the classical method on some pipelined machines where jumps are expensive.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Superoptimizer -- A Look at the Smallest Program
Henry Massalin
Department of Computer Science
Columbia University
New York, NY 10027
Abstract
Given an instruction set, the superoptimizer finds the shortest
program to compute a function. Startling programs have been
generated, many of them engaging in convoluted bit-fiddling bearing
little resemblance to the source programs which defined the func-
tions. The key idea in the superoptimizer is a probabilistic test that
makes exhaustive searches practical for programs of useful size. The
search space is defined by the processor's instruction set, which may
include the whole set, but it is typically restricted to a subset. By
constraining the instructions and observing the effect on the output
program, one can gain insight into the design of instruction sets. In
addition, superoptimized programs may be used by peephole op-
timizers to improve the quality of generated code, or by assembly
language programmers to improve manually written code.
1. Introduction
The search for the optimal algorithm to compute a function is one of
the fundamental problems in computer science. In contrast to
theoretical studies of optimal algorithms, practical applications
motivated the design, implementation, and use of the superoptimizer.
Instead of proving upper or lower bounds for abstract algorithms, the
superoptimizcr finds the shortest program in the program space
defined by the instruction set of commercial machines, such the
Motorola 68000 or Intei 8086.
The functions to be optimized are specified with programs written
using the target machine's instruction set. Therefore, the input to the
superoptimizer is a machine language program. The output is
another program, which may be shorter. Since both programs run on
the same processor, with a well-defined environment, we can estab-
lish their equivalence.
A probabilistie test and a method for pruning the search tree makes
the superoptimizer a practical tool for programs of limited size
(about 13 machine instructions).
In section 2, we describe an interesting example to illustrate the su-
peroptimizer approach. The design azd algorithms used in the super-
optimizer are detailed in section 3. We discuss the applications and
limitations of the superoptimizer in section 4. In section 5, we corn-
Permission to copy without fee all or part of this material is granted
provided that the copies are not made or distributed for direct commercial
advantage, the ACM copyright notice and the title of the publication and
its date appear, and notice is given that copying is by permission of the
Association for Computing Machinery. To copy otherwise, or to
republish, requires a fee and/or specific permission.
pare the superoptimizer with related work. The conclusion in section
6 is followed by a list of interesting minimal programs in appendix I.
2. An Interesting Example
We begin with an example to show what superoptimized code looks
like. The instruction set used here, as in most of the paper, is
Motorola's 68020 instruction set. Our example is the
signum
func-
tion, defined by the following program:
signum (x)
int x;
{
if(x > 0) return I;
else if(x < 0} return -I;
else return 0;
)
This function compiles to 9 instructions occupying 18 bytes of
memory on the SUN-3 C compiler. Most programmers when asked
to write this function in assembly language would use comparison
instructions and conditional jumps to decide in what range the ar-
gument lies. Typically, this takes 8 68020 instructions, although
clever programmers can do it in 6.
It turns out that by exploiting various properties of two's comple-
ment arithmetic one can write
signum
in four instructions[ This is
what superoptimizer found when fed the compiled machine code for
the signum function as input:
(x in dO)
add.l d0,d0 ladd dO to itself
subx.l dl,dl lsubtract (dl + Carry) from dl
negx.l dO Iput (0 - dO - Carry) into dO
addx.l dl,dl ladd (dl + Carry) to dl
(signum(x) in dl} (4 instructions}
Like a typical superoptimized program, the logic is really con-
voluted. One of the first things that comes to mind is "where are the
conditional jumps?". As we will see later, many functions that
would normally be written with conditional jumps are optimized into
short programs without them. This can result in significant speedups
for certain pipelined machines that execute conditional jumps slowly.
Let us see how it works. The "add.l dO, dO" instruction doubles the
contents of register dO, but more importandy, the sign bit is now in
the carry flag. The "subx.l dl, dl" instruction computes "dl-dl-
carry --> dl". Regardless of the initial value of dl, dl-dl-carry is
-carry. Thus dl is -1 if dO was negative and 0 otherwise. Besides
negating, "negx.i dO" will set the carry flag if and only if dO was
nonzero. Finally, "addx.I dl, dl" doubles dl and adds the carry. Now
if dO was negative, dl is -1 and carry is set, so dl+dl+carry is -1, if
dO was 0, dl is 0 and carry is clear, so d0+d0+carry is 0, if dO was
positive, dl is 0 and carry isset, so dl+dl+carry is I.
© 1987 ACM
0-89791-238-1/87/1000-0122 $00 75
122

3. Superoptimizer Internals
Superoptimizer takes a program written in machine language as the
input source. It finds the shortest program that computes the same
function as the source program by doing an exhaustive search over
all possible programs. The search space is defined by choosing a
subset of the machine's instruction set, and the op-codes of these
instructions are stored in a table. Superoptimizer consults this table
and generates all combinations of these instructions, first of length 1,
then of length 2, and so on. Each of these generated programs is
tested, and if found to match the function of the source program,
superoptimizer prints the program and halts.
Two methods are used to reduce the search time. The first is a fast
probabilistie test for determining the the equivalence of two
programs. The second is a method for pruning the search space while
maintaining the guarantee of optimality. These two methods will
now be discussed, but first a boolean-logic equivalence test will be
explained, which was the first test proceedure implemented, because
it finds use in the tree pruning method.
3.1. Boolean Test
The most important part of superoptimizer is the routine that deter-
mines whether two pieces of code computes the same function. The
first version of superoptimizer used what we call the
boolean
program verifier. The
idea was to express the function output in
terms of boolean-logic operations on the input argument. Once this
is done, two programs are equivalent if their boolean expressions
matches minterm for minterm.
In practice, some instructions such as
add and mul have boolean ex-
pressions with on the order of 2^31 minterms. Various methods had
been devised to reduce the memory requirements, but it took too
long to compute the boolean expressions for every program
generated. The initial version of superoptimizer tested about 40
programs per second, and this allowed programs of up to 3 instruc-
tions to be generated in reasonable time.
One problem introduced by the probabilistie execution test is
machine dependency. The test works only if the instruction set being
searched can be executed on the machine running the super-
optimizer. In other words, if we wish to change the instruction set,
we would have to port the superoptimizer tothe new machine. This
port is not too difficult since the current version of superoptimizer is
rather short (about 300 lines of 68020 assembly code), however it
does require that one translate it into the target assembly code.
3.3.
Pruning
In order to further reduce the search time, we filter out instruction
sequences that are known not to occur in any optimal program. Any
sequence of instructions that has the same effect on the machine state
as a shorter sequence cannot be part of an optimal program, because
if it were, you can get a shorter program by substituting the shorter
sequence, and therefore the program was not optimal. Typical se-
quences include the obviously silly "move X,Y; move X,Y" and
"move X,Y; move Y,X", "and X,Y; move Z,Y" in which the MOVE
destroys the result of the AND, "and #0,X" which does the same
thing as "clr X", and "and X,Y; <any> Z,W; and.l X,Y" where the
second AND is superfluous.
This filtering is done with N-dimensional bit tables, where N is the
length of the longest sequence we wish to filter. Each instruction in
the sequence we wish to test indexes one dimension of the bit table,
and a lookup value of' 1' causes the program to be rejected as non-
optimal (and also as incorrect, since it is the same as a shorter
program, and superoptimizer has already checked all shorter
programs).
There are two ways that these bit tables can be filled. A human can
tell the bit table maker program to exclude all "move X,Y; move
Y,X" sequences. The program then scans all instructions in all
dimensions of the bit matrix and sets the values accordingly. One
can also run superoptimizer with the boolean test, and have it find
the equivalences on its own.
3.2. Probabilistlct Test
The idea behind the probabilistic test is simple: run the machine
code for the program being tested a few times with some set of in-
puts and check whether the outputs match those of the source
program. The idea here is that most programs will fail this simple
test, and a full program verification test will be done only for the few
programs that this test fails to catch. Running thmugh a few care-
fully chosen test vectors takes very little time. Currently, super-
optimizer can test 50000 programs per second and the exhaustive
search approach becomes practical.
The test vectors are chosen (manually) to maximize the probability
that a random program will fail on the first or second test. For ex-
ample, the test vectors for the
signum function included -1000, 0 and
456 as the first three vectors. This quickly eliminates programs that
return the same answer regardless of argument, answers of the same
sign, as well as programs that return their argument. Following these
vectors, all the numbers from -1024 to 1024 were tested.
It was found in practice that a program, has a very low probability of
passing this execution test and failing the boolean verification test.
This fact proves very useful since most programs of interest have
boolean expressions that are too large to fit in memory. We can
dispense with the boolean test and manually inspect the generated
programs for correctness, without having to analyze a large number
of wrong programs. This manual check is not difficult since the
programs are small (about 4 to 13 instructions). Currently, super-
optimizer runs without the boolean check, and the author has yet to
find an incorrect program.
4. Applications and Limitations
4.1. Current Limitations
Even with the pmbabilistie test, the exhaustive search still grows ex-
ponentially with the number of instructions in the generated
program. The current version of superoptimizer has generated
programs 12 instructions long in several hours running time on a
16MHz 68020 computer. Therefore, the superoptimizer has limited
usefulness as a code generator for a compiler.
Another difficulty concerns pointers. A pointer can point anywhere
in memory and so to model a pointer in terms of boolean expressions
one needs to take all of memory into account. Even on a 256-byte
machine, there are 2A(2^(256"8)) possible minterms, and these are
just too many. We have explored the probabilistie test approach for
pointers, but the results have heed inconclusive.
Currently, we have only the 68020 version of the superoptimizer run-
ning the probabilistic test, so the instruction sets are restricted to sub-
sets of the 68020 set. The machine-independent version of super-
optimizer is limited to very short programs.
4.2.
Applications
Because of the pointer problem, superr0pfimizer works best when the
instruction set is constrained to register-register operations. Even so,
it can be used to analyze instruction sets. Some of the programs in
appendix I were tried on the Western Electric WE32000
microprocessor and in every case the resulting program was longer
123

than the 68020 programs. The reason for this was found m be the
lack of an add-with-carry instmction and the fact that the flags are
set according to the 32 bit result, even for byte sized operands, The
National Semiconductor NS32032 was also found to suffer from flag
problems. Here the difficulty is that extra instructions are needed to
test the outcome of an operation because few instructions set the
flags.
Another use would be in the design of RISC architectures. One can
try various instruction sets simply by coding their function in terms
of boolean expressions and seeing what superoptimizer comes up
with. A particular instruction may be omitted if superoptimizer finds
a short equivalent sequence of other instructions.
The superoptimizer may be very useful in optimizing little tasks that
often confront a compiler. An example is finding the optimal
program that multiplies by a particular constant for use in accessing
arrays and such. Some examples of multiplication by constants can
be found in 1.6.
Another useful feature of superoptimizer is the identity tables con-
taining the equivalent program sequences found. These programs
may be extracted and used to increase the power of a conventional
peephole optimizer.
In practice, the best use of superoptimizer has been as an aid to the
assembly language programmer. An experienced programmer can
use superoptimizer to come up with nifty equivalent sequences for
small sections of his code, while retaining the overall logical flow
that makes a program maintainable. This method has been used by
the author (along with another program that optimizes code emulat-
ing state machines) to write the C library function prino ~ in only 500
bytes.
5. Comparison with
Related Work
The most commonly used optimization techniques are those that at-
tempt to improve the code that a compiler produces. Examples are
peephole optimizers and data-flow analysis. Peephole optimizers
[2] are table driven pattern matchers that operate on the assembly
language code produced by the compiler. Every time a sequence of
instmctions is matched by one of the tables, a smaller and faster
replacement sequence is used.
Data-flow analysis [1] is a technique applied during the semantic and
code generation phases of the compilation process. It improves code
in several ways. First, it eliminates redundant computations
(common sub-expression elimination). Second, it moves expressions
within a loop whose values do not depend on the loop variable to
outside the loop (loop invariance). Third, (also in a loop) it converts
expressions of the form 'K * loop-index' into the equivalent arith-
metic progression 'TMP ffi TMP + K' (strength reduction).
These methods are general. They work regardless of the machine-
specific details such as the representation of an integer. However,
usually the result is not optimal in either space or speed. Super-
optimizer depends on the instruction set, however, the code is
guaranteed to be optimal in space and it does a very good job in
speed as well.
Kmmme and Ackley [4] have written a code generator for the
DEC-10 computer that is based on exhaustive search. Their method
translates each interior node of an expression tree into several viable
instruction sequences. These sequences are then pieced together to
form a set of translations for the entire expression. This set is then
searched to find the cheapest alternative.
In their method, there is a one to one correspondence between the
instructions in the translation and the original expression. For ex-
ample, if there's an add in the expression, there will also be an add
somewhere in the generated code. Superoptimizer has a more global
view of the problem. It 'translates' one sequence of instructions into
another completely different sequence. On the other hand, super-
optimizer can't translate large programs.
The two approaches can be seen as complementing each other. Su-
peroptimizer can be used to prepare the code generation tables used
m Krumme and Ackley's method. Their method can also be incor-
porated into superoptimizer to increase the size of programs that can
be handled. Superoptimizer can generate several short equivalent
sequences for small fragments of the source program, and then
Krumme and Aekley's method would be used to piece these together
and find a short overall sequence.
Kessler [3] has written a code optimization tool, which translates se-
quences of instructions into one single instruction. The super-
optimizer can be seen as a more general tool with broader applica-
tions, since it can transform programs of many instructions to
another one of several instructions. However, Kessler's optimizer
works regardless of program size, and therefore can be easily used to
optimize compiled code. Another difference is that he uses template
matching, while supemptimizer relies on exhaustive search.
6.
Conclusion
We have taken a practical approach to the search for the optimal
program. We have found that the shortest programs are surprising,
often containing sequences of instructions that one would not expect
to see side by side. The signum function is an example of this, and
the min and max functions given in section 1.3 contain a beautiful
combination of the logical and and the arithmetic add.
Exhaustive search is justified by these results, and a probabilistic test
allows programs of practical size to be produced. Although results
are limited to a dozen instructions, those found are already useful.
Many examples of these can be found in Appendix I.
One of the most interesting results is not the programs themselves,
but a better understanding of the interrelations between arithmetic
and logical instructions. Similar ideas seem to come up consistently
in the superoptimized programs, These include the sequence 'add.l
dl,dl; subx.l dl,dl' that extracts the sign of a number in the signum
and abs functions and the sequence 'sub.l dl,dO; and.1 d2,do; add.l
dl ,dO' that selects one of two values depending on a third in the rain
and max functions.
In the future, we hope to explore these ideas further, and compile a
list of useful arithmetic-logical idioms that can be concatenated to
form optimal or near-optimal programs.
Appendix
I. More Interesting Results
1.1. SIGNUM
Function
The signum function has been defined in section 2. Given the 68000
instruction set, four is the minimum number of instructions to com-
pute signum. Interestingly, three suffice on the 8086.
(x in
ax)
cwd (sign extends register ax into dx)
neg ax
adc dx, dx
(slgnum(xl in dx}
124

!~,;~ .....................
Find the absolute value of a number, excluding conditional jumps
from the instruction set.
(x in dO)
move. 1
d0,dl
add. 1 dl, dl
subx. 1 dl, dl
eor.1
dl,d0
sub.
1 dl, dO
(abs ix) in dO)
Notice that although it is longer than the classical method (test;
jump-if-positive; negate), it has no jumps! This might actually be
faster than the classical method on some pipelined machines where
jumps are expensive.
1.3. Max and Min
This program finds the maximum of the unsigned numbers in dO and
dl and returns the answer in dO. The comments on the right show
what's in the various registers during execution and is similar to the
boolean expression checker's method of analysis.
(d0-X, dl-Y) lFlag,ReglIf di>d0 lIf dl<md0
sub.l
dl,d0[ (C,d0) -I (I, X-Y) I (0, X-Y)
subx.1
d2,d2l (C,d2) -1 (1,11..11) [ (0,0...0)
or.1 d2,d01(C,d0) -I (1,11..11)I(0,X-Y)
addx. 1 dl,d0ld0 - IY IX
(dO -
max(X, Y))
This program finds the minimum of the unsigned numbers in d0'and
dl and returns the answer in dO.
(d0-X, dl-Y) lFlag,Regllf dl>d0
liE
dl<-d0
sub.1 dl,d0l (C,d0) -I (1, X-Y) I (0, X-Y}
subx.l d2,d2ld2- 1111,..111
1000...000
and.l d2, d01 d0 - IX-Y
I0
add.1 dl, d0ld0 - IX IY
(dO - min (X, Y) )
Simultaneous min and max.
(d0-X, dl-Y) lFlag, ReglIf dl>d0 lIf dl<-d0
sub.1
dl,d0l (C,d0) -I (1, X-Y) [ C0, X-Y)
subx.1 d2,d2ld2- 1111...111
1000...000
and.1 d0, d2 ld2 - IX-Y
l0
eor.1
d2, d0l d0 - l0 [X-Y
add.l dl,d0ld0 - IY IX
add.l d2, dlldl - IX [Y
(dO -
max(X, Y), dl - rain(X, Y))
1.4. Logical Tests
Here are some logical tests that yield true/false answers. Sequences
such as these have immediate application in a compiler to improve
execution speed. Shown here are the tests for zero and non-zero.
Suitable
for BASIC Suitable for C, PASCAL
dO = 0 if dO -- 0 dO - 0 if dO -- 0
- -1 if dO l- 0 -1 if dO !- 0
neg.
1 dO neg. 1 dO
subx. 1 d0,d0 subx. 1 d0,d0
neg.1 dO
dO - -1 if dO -- 0 dO - 1 if dO ~- 0
0 if dO !- 0 - 0 if dO !- 0
neg.
1 dO neg. 1 dO
subx.
1 d0, dO subx. 1 d0, dO
not. 1 dO addq. 1 1, dO
By prepending 'move.l A,d0; sub.l B,d0' to the abave one can con-
struct tests for A == B and A
l=
B.
1.5. Decimal to Binary
This piece converts a 8 digit BCD number stored in dO, one digit to a
nibble, to binary with the result also in dO. It is the longest sequence
ever generated by superoptimizer, and was actually done in three
sequences to multiply by 10. At first I had superoptimizer compute
the 2 digit BCD to binary conversion function '((dO & 0xF0) >> 4) *
10 + (dO
& OxOF)'. This came out surprisingly short:
(2 dlgit
BCD number
In dO)
move. b
d0,dl
and.b
#$F0,dl
isr.b
#3,dl
sob.b
dl,d0
sub.b
dl,d0
sub.
b dl, dO
(binary equivalent
in dO}
What is actually being computed is
arts
-- dO - 3 * ((dO & 0xF0)/8)
Representing the contents of dO as (H:L) whereH is the upper nibble
and L is the lower nibble we get
dO - 16 * H + L, dO & 0xF0 - 16"H
ans - (16*H+L) - 3 * (16"H/8)
-
16*H+L - 6*H
- 10*H + L
which is the 2 digit BCD to binary function. Encouraged by this
result, superoptimizer was put to the task of computing first the 4
digit BCD to binary function and then the 8 digit BCD to binary
function. Here is the 8 digit converter:
(8 digit
BCD number
in dO)
move. 1 d0,dl *
and.l #$FOFOFOF0, 11 *
isr.1 #3,dl *
sub.
1 dl, dO *
sub.
1 dl, dO *
sub.
1 dl, dO *
move.
1 d0, dl +
and.
1 #$FF00FF00, dl +
lsr.1 |1,dl +
sub.
1 dl, dO +
Isr.l #2,dl +
sub.
1 dl, dO +
lsr.l #3,dl +
add. 1 dl, dO +
move. 1
d0,dl
swap dl
mulu #$DSf0,dl
sub.
1 dl, dO
(binary
equivalent
in dO)
What is most amazing is the first section (marked by * alongside the
program) It looks exactly like the 2 digit BCD to binary function.
This section computes 4 simultaneous 2 digit BCD to binary func-
tions on adjacent pairs of nibbles and deposits the answer back into
the byte occupied by those nibbles. The second part (marked by +)
computes two simultaneous 2-byte base 100 to binary conversion
functions. Finally, the third part computes the function 'high-word-
of-d0 * 10000 + low-word-of-d0' to complete the conversion.
1.6. Multiplication by Constants
During a two week period, superopdmizer Was used to find minimal
programs that multiply by constants. A sampling of these programs
is included in this section.
An interesting observation is that the average program size increases
as the multiplication constant increaseS, but it increases very slowly.
The average size of programs that multiply by small numbers (less
than 40) is 5 instructions, most programs that multiply by numbers in
the hundreds are 6 to 7 instructions long, and programs that multiply
by thousands are between 7 and 8 instructions long.
dO *- 29 dO *- 39
move.
1 dO, dl move. 1 d0, dl
181.1 #4,d0 lsl.l
#2,d0
sub.
1 dl, dO add. 1 dl, dO
add.l d0.d0 Isl.l #3,d0
sub.
1 dl, dO sub. 1 dl. dO
125

dO *m 625
move.l dO, dl
dO *- 156 Isl.l #2,d0
move.l dO,dl add.l dl,dO
ls1.1 #2,dl ls1.1 #3,dO
add.1 dl,dO sub.1 dl,dO
lsl.l #5,dO ls1.1 #4,dO
sub.1 dl,dO add.1 dl,dO
1.7. Division
by Constants
Division turns out to be difficult to optimize. A general divide by
constant that works for all 32-bit arguments is too long to realize any
time gain over the divide instruction, and is certainly not shorter.
Additionally, there doesn't seem to be any nifty arithmetic-logical
operations that simplify the process. The generated programs just
multiply by the reciprocal of the constant. Since we do an exhaus-
tive search, this negative result can be seen as a confirmation of the
inherent high cost of divisions for the instruction sets considered.
The following programs were generated in an attempt to gain insight
into binary to BCD algorithms, another area where superoptimizer
has had little success. Note that even with the restricted argument
range, these are much longer than the multiply programs.
dO - trunc(dO/lO) for dO - 0..99
move.b dO, d1
add.b dO,dO IdO - 10 * x
isr.b #1,dl Idl -
.1 * x
add.b dl,dO ldO - 10.1 * x
Isr.b #3,dO [dO - .0101 * x
add.b dl,dO IdO - .1101 * x
lsr.b #3, dO IdO - .0001101 * x
dO - trunc(dO/lO0) for dO - 0..9999
move.w
dO, d1
lsr.w #1,dl Idl
- .1 * x
add.w dO, dO [dO - i0 * g
add.w dO, d1 ldl - 10.1 * x
lsr.w #5,dO ldO - .0001 * x
add.w dl,dO ldO - 10.1001 * x
isr.w #8,dl Jnote: you can't isr.w #10,dl
Isr.w #2,dl [dl - .00000000101 * x
sub.w dl,dO IdO - 10.10001111011
isr.w #8, dO IdO - .0000001010001111011 * x
References
[1] Aho, A.V., 8ethi, R, Uilman, J.D.
Compilers Principles, Techniques, and Tools.
Addison Wesley, 1986.
[2] Davidson, J.W. and Fraser, C.W.
Automatic Generation of Peephole Optimizations.
In
Proceedings of the ACM SIGPLAN ' 84 Symposium on
Compiler Construction,
pages 111-116.
ACM/SIGPLAN, June, 1984.
[3] Kessler, P.B.
Discovering Machine-Specific Code Improvements.
In
Proceedings of the ACM SIGPLAN ' 86 Symposium on
Compiler Construction,
pages 249-254.
ACM/SIGPLAN, June, 1986.
[4] Krumme, D.W. and Aekley, D.H.
A Practical Method for Code Generation Based On Exhaus-
tive Search.
In
Proceedings of the ACM SIGPLAN ' 82 Symposium on
Compiler Construction,
pages 185-196.
ACM/SIGPLAN, June, 1982.
126
Citations
More filters
Journal ArticleDOI
24 Jan 2005
TL;DR: It is shown that such an approach can yield an implementation of the discrete Fourier transform that is competitive with hand-optimized libraries, and the software structure that makes the current FFTW3 version flexible and adaptive is described.
Abstract: FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with hand-optimized libraries, and describes the software structure that makes our current FFTW3 version flexible and adaptive. We further discuss a new algorithm for real-data DFTs of prime size, a new way of implementing DFTs by means of machine-specific single-instruction, multiple-data (SIMD) instructions, and how a special-purpose compiler can derive optimized implementations of the discrete cosine and sine transforms automatically from a DFT algorithm.

5,172 citations

18 Dec 2006
TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.
Abstract: Author(s): Asanovic, K; Bodik, R; Catanzaro, B; Gebis, J; Husbands, P; Keutzer, K; Patterson, D; Plishker, W; Shalf, J; Williams, SW | Abstract: The recent switch to parallel microprocessors is a milestone in the history of computing. Industry has laid out a roadmap for multicore designs that preserves the programming paradigm of the past via binary compatibility and cache coherence. Conventional wisdom is now to double the number of cores on a chip with each silicon generation. A multidisciplinary group of Berkeley researchers met nearly two years to discuss this change. Our view is that this evolutionary approach to parallel hardware and software may work from 2 or 8 processor systems, but is likely to face diminishing returns as 16 and 32 processor systems are realized, just as returns fell with greater instruction-level parallelism. We believe that much can be learned by examining the success of parallelism at the extremes of the computing spectrum, namely embedded computing and high performance computing. This led us to frame the parallel landscape with seven questions, and to recommend the following: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS per development dollar. • Instead of traditional benchmarks, use 13 “Dwarfs” to design and evaluate parallel programming models and architectures. (A dwarf is an algorithmic method that captures a pattern of computation and communication.) • “Autotuners” should play a larger role than conventional compilers in translating parallel programs. • To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications. • To be successful, programming models should be independent of the number of processors. • To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: task-level parallelism, word-level parallelism, and bit-level parallelism. 1 The Landscape of Parallel Computing Research: A View From Berkeley • Architects should not include features that significantly affect performance or energy if programmers cannot accurately measure their impact via performance counters and energy counters. • Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines. • To explore the design space rapidly, use system emulators based on Field Programmable Gate Arrays (FPGAs) that are highly scalable and low cost. Since real world applications are naturally parallel and hardware is naturally parallel, what we need is a programming model, system software, and a supporting architecture that are naturally parallel. Researchers have the rare opportunity to re-invent these cornerstones of computing, provided they simplify the efficient programming of highly parallel systems.

2,262 citations

Journal ArticleDOI
TL;DR: The authors argue that public managers should look inside the "black box" of collaboration processes and find a complex construct of five variable dimensions: governance, administration, organizational autonomy, mutuality, and norms.
Abstract: Social science research contains a wealth of knowledge for people seeking to understand collaboration processes. The authors argue that public managers should look inside the “black box” of collaboration processes. Inside, they will find a complex construct of five variable dimensions: governance, administration, organizational autonomy, mutuality, and norms. Public managers must know these five dimensions and manage them intentionally in order to collaborate effectively.

1,115 citations

Journal ArticleDOI
TL;DR: This survey is a comprehensive overview of the important high-level program restructuring techniques for imperative languages, such as C and Fortran, and describes the purpose of each transformation, how to determine if it is legal, and an example of its application.
Abstract: In the last three decades a large number of compiler transformations for optimizing programs have been implemented. Most optimizations for uniprocessors reduce the number of instructions executed by the program using transformations based on the analysis of scalar quantities and data-flow techniques. In contrast, optimizations for high-performance superscalar, vector, and parallel processors maximize parallelism and memory locality with transformations that rely on tracking the properties of arrays using loop dependence analysis.This survey is a comprehensive overview of the important high-level program restructuring techniques for imperative languages, such as C and Fortran. Transformations for both sequential and various types of parallel architectures are covered in depth. We describe the purpose of each transformation, explain how to determine if it is legal, and give an example of its application.Programmers wishing to enhance the performance of their code can use this survey to improve their understanding of the optimizations that compilers can perform, or as a reference for techniques to be applied manually. Students can obtain an overview of optimizing compiler technology. Compiler writers can use this survey as a reference for most of the important optimizations developed to date, and as bibliographic reference for the details of each optimization. Readers are expected to be familiar with modern computer architecture and basic program compilation techniques.

946 citations

Proceedings ArticleDOI
01 May 2010
TL;DR: A novel approach to automatic synthesis of loop-free programs based on a combination of oracle-guided learning from examples, and constraint-based synthesis from components using satisfiability modulo theories (SMT) solvers is presented.
Abstract: We present a novel approach to automatic synthesis of loop-free programs. The approach is based on a combination of oracle-guided learning from examples, and constraint-based synthesis from components using satisfiability modulo theories (SMT) solvers. Our approach is suitable for many applications, including as an aid to program understanding tasks such as deobfuscating malware. We demonstrate the efficiency and effectiveness of our approach by synthesizing bit-manipulating programs and by deobfuscating programs.

485 citations

References
More filters
Book
01 Jan 1986
TL;DR: This book discusses the design of a Code Generator, the role of the Lexical Analyzer, and other topics related to code generation and optimization.
Abstract: 1 Introduction 1.1 Language Processors 1.2 The Structure of a Compiler 1.3 The Evolution of Programming Languages 1.4 The Science of Building a Compiler 1.5 Applications of Compiler Technology 1.6 Programming Language Basics 1.7 Summary of Chapter 1 1.8 References for Chapter 1 2 A Simple Syntax-Directed Translator 2.1 Introduction 2.2 Syntax Definition 2.3 Syntax-Directed Translation 2.4 Parsing 2.5 A Translator for Simple Expressions 2.6 Lexical Analysis 2.7 Symbol Tables 2.8 Intermediate Code Generation 2.9 Summary of Chapter 2 3 Lexical Analysis 3.1 The Role of the Lexical Analyzer 3.2 Input Buffering 3.3 Specification of Tokens 3.4 Recognition of Tokens 3.5 The Lexical-Analyzer Generator Lex 3.6 Finite Automata 3.7 From Regular Expressions to Automata 3.8 Design of a Lexical-Analyzer Generator 3.9 Optimization of DFA-Based Pattern Matchers 3.10 Summary of Chapter 3 3.11 References for Chapter 3 4 Syntax Analysis 4.1 Introduction 4.2 Context-Free Grammars 4.3 Writing a Grammar 4.4 Top-Down Parsing 4.5 Bottom-Up Parsing 4.6 Introduction to LR Parsing: Simple LR 4.7 More Powerful LR Parsers 4.8 Using Ambiguous Grammars 4.9 Parser Generators 4.10 Summary of Chapter 4 4.11 References for Chapter 4 5 Syntax-Directed Translation 5.1 Syntax-Directed Definitions 5.2 Evaluation Orders for SDD's 5.3 Applications of Syntax-Directed Translation 5.4 Syntax-Directed Translation Schemes 5.5 Implementing L-Attributed SDD's 5.6 Summary of Chapter 5 5.7 References for Chapter 5 6 Intermediate-Code Generation 6.1 Variants of Syntax Trees 6.2 Three-Address Code 6.3 Types and Declarations 6.4 Translation of Expressions 6.5 Type Checking 6.6 Control Flow 6.7 Backpatching 6.8 Switch-Statements 6.9 Intermediate Code for Procedures 6.10 Summary of Chapter 6 6.11 References for Chapter 6 7 Run-Time Environments 7.1 Storage Organization 7.2 Stack Allocation of Space 7.3 Access to Nonlocal Data on the Stack 7.4 Heap Management 7.5 Introduction to Garbage Collection 7.6 Introduction to Trace-Based Collection 7.7 Short-Pause Garbage Collection 7.8 Advanced Topics in Garbage Collection 7.9 Summary of Chapter 7 7.10 References for Chapter 7 8 Code Generation 8.1 Issues in the Design of a Code Generator 8.2 The Target Language 8.3 Addresses in the Target Code 8.4 Basic Blocks and Flow Graphs 8.5 Optimization of Basic Blocks 8.6 A Simple Code Generator 8.7 Peephole Optimization 8.8 Register Allocation and Assignment 8.9 Instruction Selection by Tree Rewriting 8.10 Optimal Code Generation for Expressions 8.11 Dynamic Programming Code-Generation 8.12 Summary of Chapter 8 8.13 References for Chapter 8 9 Machine-Independent Optimizations 9.1 The Principal Sources of Optimization 9.2 Introduction to Data-Flow Analysis 9.3 Foundations of Data-Flow Analysis 9.4 Constant Propagation 9.5 Partial-Redundancy Elimination 9.6 Loops in Flow Graphs 9.7 Region-Based Analysis 9.8 Symbolic Analysis 9.9 Summary of Chapter 9 9.10 References for Chapter 9 10 Instruction-Level Parallelism 10.1 Processor Architectures 10.2 Code-Scheduling Constraints 10.3 Basic-Block Scheduling 10.4 Global Code Scheduling 10.5 Software Pipelining 10.6 Summary of Chapter 10 10.7 References for Chapter 10 11 Optimizing for Parallelism and Locality 11.1 Basic Concepts 11.2 Matrix Multiply: An In-Depth Example 11.3 Iteration Spaces 11.4 Affine Array Indexes 11.5 Data Reuse 11.6 Array Data-Dependence Analysis 11.7 Finding Synchronization-Free Parallelism 11.8 Synchronization Between Parallel Loops 11.9 Pipelining 11.10 Locality Optimizations 11.11 Other Uses of Affine Transforms 11.12 Summary of Chapter 11 11.13 References for Chapter 11 12 Interprocedural Analysis 12.1 Basic Concepts 12.2 Why Interprocedural Analysis? 12.3 A Logical Representation of Data Flow 12.4 A Simple Pointer-Analysis Algorithm 12.5 Context-Insensitive Interprocedural Analysis 12.6 Context-Sensitive Pointer Analysis 12.7 Datalog Implementation by BDD's 12.8 Summary of Chapter 12 12.9 References for Chapter 12 A A Complete Front End A.1 The Source Language A.2 Main A.3 Lexical Analyzer A.4 Symbol Tables and Types A.5 Intermediate Code for Expressions A.6 Jumping Code for Boolean Expressions A.7 Intermediate Code for Statements A.8 Parser A.9 Creating the Front End B Finding Linearly Independent Solutions Index

8,437 citations

Proceedings ArticleDOI
01 Jun 1984
TL;DR: A general peephole optimizer driven by a machine description produces optimizations at compile-compile time for a fast, pattern-directed, compile-time optimizer.
Abstract: This paper describes a system that automatically generates peephole optimizations. A general peephole optimizer driven by a machine description produces optimizations at compile-compile time for a fast, pattern-directed, compile-time optimizer. They form part of a compiler that simplifies retargeting by substituting peephole optimization for case analysis.

46 citations

Proceedings ArticleDOI
01 Jun 1982
TL;DR: An original method for code generation has been developed in conjunction with the construction of a compiler for the C programming language on the DEC-10 computer, and is table-driven, with most machine-specific information isolated in the tables.
Abstract: An original method for code generation has been developed in conjunction with the construction of a compiler for the C programming language on the DEC-10 computer. The method is comprehensive, determining evaluation order and doing register allocation and instruction selection simultaneously. It uses exhaustive search rather than heuristics, and is table-driven, with most machine-specific information isolated in the tables. Testing and evaluation have shown that the method is effective, that the search process is not too time consuming, and that the compiler is capable of producing code as good as that of other optimizing compilers.

20 citations

Proceedings ArticleDOI
01 Jul 1986
TL;DR: A compiler construction tool that automates much of the case analysis necessary to exploit special purpose instructions on a target machine is designed and built, and a working prototype of the instruction set analyzer needed in the framework outlined by [Giegerich 83].
Abstract: I have designed and built a compiler construction tool that automates much of the case analysis necessary to exploit special purpose instructions on a target machine. Given a suitable description of the target machine, my analysis identifies instruction sequences that are equivalent to single instructions. During code generation, these equivalences can be used to avoid inefficient instruction sequences in favor of more efficient instructions.I present a working prototype of the instruction set analyzer needed in the framework outlined by [Giegerich 83]. In contrast to the work presented in [Davidson and Fraser 80, 84], I analyze machine descriptions during compiler construction, rather than analyzing instruction sequences that occur during code generation. [R Kessler 84] describes a system which analyzes machine descriptions during compiler construction, but which which is limited to discovering instructions that are equivalent to instruction sequences of length 2. The techniques presented here can identify instruction sequences of arbitrary length that are equivalent to single instructions.I have applied this analysis to the descriptions of two machines, and used the results to replace hand-written case analysis routines in an otherwise table-driven code generator [Henry 84].

18 citations

Frequently Asked Questions (1)
Q1. What are the contributions in this paper?

The superoptimizer this paper is a probabilistic test that makes exhaustive searches practical for programs of useful size.