What are the contributions in this paper?

Q: What are the contributions in this paper?

The superoptimizer this paper is a probabilistic test that makes exhaustive searches practical for programs of useful size.

(Open Access) Superoptimizer: a look at the smallest program (1987) | Henry Massalin

Superoptimizer -- A Look at the Smallest Program

Henry Massalin

Department of Computer Science

Columbia University

New York, NY 10027

Abstract

Given an instruction set, the superoptimizer finds the shortest

program to compute a function. Startling programs have been

generated, many of them engaging in convoluted bit-fiddling bearing

little resemblance to the source programs which defined the func-

tions. The key idea in the superoptimizer is a probabilistic test that

makes exhaustive searches practical for programs of useful size. The

search space is defined by the processor's instruction set, which may

include the whole set, but it is typically restricted to a subset. By

constraining the instructions and observing the effect on the output

program, one can gain insight into the design of instruction sets. In

addition, superoptimized programs may be used by peephole op-

timizers to improve the quality of generated code, or by assembly

language programmers to improve manually written code.

1. Introduction

The search for the optimal algorithm to compute a function is one of

the fundamental problems in computer science. In contrast to

theoretical studies of optimal algorithms, practical applications

motivated the design, implementation, and use of the superoptimizer.

Instead of proving upper or lower bounds for abstract algorithms, the

superoptimizcr finds the shortest program in the program space

defined by the instruction set of commercial machines, such the

Motorola 68000 or Intei 8086.

The functions to be optimized are specified with programs written

using the target machine's instruction set. Therefore, the input to the

superoptimizer is a machine language program. The output is

another program, which may be shorter. Since both programs run on

the same processor, with a well-defined environment, we can estab-

lish their equivalence.

A probabilistie test and a method for pruning the search tree makes

the superoptimizer a practical tool for programs of limited size

(about 13 machine instructions).

In section 2, we describe an interesting example to illustrate the su-

peroptimizer approach. The design azd algorithms used in the super-

optimizer are detailed in section 3. We discuss the applications and

limitations of the superoptimizer in section 4. In section 5, we corn-

Permission to copy without fee all or part of this material is granted

provided that the copies are not made or distributed for direct commercial

advantage, the ACM copyright notice and the title of the publication and

its date appear, and notice is given that copying is by permission of the

Association for Computing Machinery. To copy otherwise, or to

republish, requires a fee and/or specific permission.

pare the superoptimizer with related work. The conclusion in section

6 is followed by a list of interesting minimal programs in appendix I.

2. An Interesting Example

We begin with an example to show what superoptimized code looks

like. The instruction set used here, as in most of the paper, is

Motorola's 68020 instruction set. Our example is the

signum

func-

tion, defined by the following program:

signum (x)

int x;

{

if(x > 0) return I;

else if(x < 0} return -I;

else return 0;

)

This function compiles to 9 instructions occupying 18 bytes of

memory on the SUN-3 C compiler. Most programmers when asked

to write this function in assembly language would use comparison

instructions and conditional jumps to decide in what range the ar-

gument lies. Typically, this takes 8 68020 instructions, although

clever programmers can do it in 6.

It turns out that by exploiting various properties of two's comple-

ment arithmetic one can write

signum

in four instructions[ This is

what superoptimizer found when fed the compiled machine code for

the signum function as input:

(x in dO)

add.l d0,d0 ladd dO to itself

subx.l dl,dl lsubtract (dl + Carry) from dl

negx.l dO Iput (0 - dO - Carry) into dO

addx.l dl,dl ladd (dl + Carry) to dl

(signum(x) in dl} (4 instructions}

Like a typical superoptimized program, the logic is really con-

voluted. One of the first things that comes to mind is "where are the

conditional jumps?". As we will see later, many functions that

would normally be written with conditional jumps are optimized into

short programs without them. This can result in significant speedups

for certain pipelined machines that execute conditional jumps slowly.

Let us see how it works. The "add.l dO, dO" instruction doubles the

contents of register dO, but more importandy, the sign bit is now in

the carry flag. The "subx.l dl, dl" instruction computes "dl-dl-

carry --> dl". Regardless of the initial value of dl, dl-dl-carry is

-carry. Thus dl is -1 if dO was negative and 0 otherwise. Besides

negating, "negx.i dO" will set the carry flag if and only if dO was

nonzero. Finally, "addx.I dl, dl" doubles dl and adds the carry. Now

if dO was negative, dl is -1 and carry is set, so dl+dl+carry is -1, if

dO was 0, dl is 0 and carry is clear, so d0+d0+carry is 0, if dO was

positive, dl is 0 and carry isset, so dl+dl+carry is I.

0-89791-238-1/87/1000-0122 $00 75

122

3. Superoptimizer Internals

Superoptimizer takes a program written in machine language as the

input source. It finds the shortest program that computes the same

function as the source program by doing an exhaustive search over

all possible programs. The search space is defined by choosing a

subset of the machine's instruction set, and the op-codes of these

instructions are stored in a table. Superoptimizer consults this table

and generates all combinations of these instructions, first of length 1,

then of length 2, and so on. Each of these generated programs is

tested, and if found to match the function of the source program,

superoptimizer prints the program and halts.

Two methods are used to reduce the search time. The first is a fast

probabilistie test for determining the the equivalence of two

programs. The second is a method for pruning the search space while

maintaining the guarantee of optimality. These two methods will

now be discussed, but first a boolean-logic equivalence test will be

explained, which was the first test proceedure implemented, because

it finds use in the tree pruning method.

3.1. Boolean Test

The most important part of superoptimizer is the routine that deter-

mines whether two pieces of code computes the same function. The

first version of superoptimizer used what we call the

boolean

program verifier. The

idea was to express the function output in

terms of boolean-logic operations on the input argument. Once this

is done, two programs are equivalent if their boolean expressions

matches minterm for minterm.

In practice, some instructions such as

add and mul have boolean ex-

pressions with on the order of 2^31 minterms. Various methods had

been devised to reduce the memory requirements, but it took too

long to compute the boolean expressions for every program

generated. The initial version of superoptimizer tested about 40

programs per second, and this allowed programs of up to 3 instruc-

tions to be generated in reasonable time.

One problem introduced by the probabilistie execution test is

machine dependency. The test works only if the instruction set being

searched can be executed on the machine running the super-

optimizer. In other words, if we wish to change the instruction set,

we would have to port the superoptimizer tothe new machine. This

port is not too difficult since the current version of superoptimizer is

rather short (about 300 lines of 68020 assembly code), however it

does require that one translate it into the target assembly code.

3.3.

Pruning

In order to further reduce the search time, we filter out instruction

sequences that are known not to occur in any optimal program. Any

sequence of instructions that has the same effect on the machine state

as a shorter sequence cannot be part of an optimal program, because

if it were, you can get a shorter program by substituting the shorter

sequence, and therefore the program was not optimal. Typical se-

quences include the obviously silly "move X,Y; move X,Y" and

"move X,Y; move Y,X", "and X,Y; move Z,Y" in which the MOVE

destroys the result of the AND, "and #0,X" which does the same

thing as "clr X", and "and X,Y; <any> Z,W; and.l X,Y" where the

second AND is superfluous.

This filtering is done with N-dimensional bit tables, where N is the

length of the longest sequence we wish to filter. Each instruction in

the sequence we wish to test indexes one dimension of the bit table,

and a lookup value of' 1' causes the program to be rejected as non-

optimal (and also as incorrect, since it is the same as a shorter

program, and superoptimizer has already checked all shorter

programs).

There are two ways that these bit tables can be filled. A human can

tell the bit table maker program to exclude all "move X,Y; move

Y,X" sequences. The program then scans all instructions in all

dimensions of the bit matrix and sets the values accordingly. One

can also run superoptimizer with the boolean test, and have it find

the equivalences on its own.

3.2. Probabilistlct Test

The idea behind the probabilistic test is simple: run the machine

code for the program being tested a few times with some set of in-

puts and check whether the outputs match those of the source

program. The idea here is that most programs will fail this simple

test, and a full program verification test will be done only for the few

programs that this test fails to catch. Running thmugh a few care-

fully chosen test vectors takes very little time. Currently, super-

optimizer can test 50000 programs per second and the exhaustive

search approach becomes practical.

The test vectors are chosen (manually) to maximize the probability

that a random program will fail on the first or second test. For ex-

ample, the test vectors for the

signum function included -1000, 0 and

456 as the first three vectors. This quickly eliminates programs that

return the same answer regardless of argument, answers of the same

sign, as well as programs that return their argument. Following these

vectors, all the numbers from -1024 to 1024 were tested.

It was found in practice that a program, has a very low probability of

passing this execution test and failing the boolean verification test.

This fact proves very useful since most programs of interest have

boolean expressions that are too large to fit in memory. We can

dispense with the boolean test and manually inspect the generated

programs for correctness, without having to analyze a large number

of wrong programs. This manual check is not difficult since the

programs are small (about 4 to 13 instructions). Currently, super-

optimizer runs without the boolean check, and the author has yet to

find an incorrect program.

4. Applications and Limitations

4.1. Current Limitations

Even with the pmbabilistie test, the exhaustive search still grows ex-

ponentially with the number of instructions in the generated

program. The current version of superoptimizer has generated

programs 12 instructions long in several hours running time on a

16MHz 68020 computer. Therefore, the superoptimizer has limited

usefulness as a code generator for a compiler.

Another difficulty concerns pointers. A pointer can point anywhere

in memory and so to model a pointer in terms of boolean expressions

one needs to take all of memory into account. Even on a 256-byte

machine, there are 2A(2^(256"8)) possible minterms, and these are

just too many. We have explored the probabilistie test approach for

pointers, but the results have heed inconclusive.

Currently, we have only the 68020 version of the superoptimizer run-

ning the probabilistic test, so the instruction sets are restricted to sub-

sets of the 68020 set. The machine-independent version of super-

optimizer is limited to very short programs.

4.2.

Applications

Because of the pointer problem, superr0pfimizer works best when the

instruction set is constrained to register-register operations. Even so,

it can be used to analyze instruction sets. Some of the programs in

appendix I were tried on the Western Electric WE32000

microprocessor and in every case the resulting program was longer

123

than the 68020 programs. The reason for this was found m be the

lack of an add-with-carry instmction and the fact that the flags are

set according to the 32 bit result, even for byte sized operands, The

National Semiconductor NS32032 was also found to suffer from flag

problems. Here the difficulty is that extra instructions are needed to

test the outcome of an operation because few instructions set the

flags.

Another use would be in the design of RISC architectures. One can

try various instruction sets simply by coding their function in terms

of boolean expressions and seeing what superoptimizer comes up

with. A particular instruction may be omitted if superoptimizer finds

a short equivalent sequence of other instructions.

The superoptimizer may be very useful in optimizing little tasks that

often confront a compiler. An example is finding the optimal

program that multiplies by a particular constant for use in accessing

arrays and such. Some examples of multiplication by constants can

be found in 1.6.

Another useful feature of superoptimizer is the identity tables con-

taining the equivalent program sequences found. These programs

may be extracted and used to increase the power of a conventional

peephole optimizer.

In practice, the best use of superoptimizer has been as an aid to the

assembly language programmer. An experienced programmer can

use superoptimizer to come up with nifty equivalent sequences for

small sections of his code, while retaining the overall logical flow

that makes a program maintainable. This method has been used by

the author (along with another program that optimizes code emulat-

ing state machines) to write the C library function prino ~ in only 500

bytes.

5. Comparison with

Related Work

The most commonly used optimization techniques are those that at-

tempt to improve the code that a compiler produces. Examples are

peephole optimizers and data-flow analysis. Peephole optimizers

[2] are table driven pattern matchers that operate on the assembly

language code produced by the compiler. Every time a sequence of

instmctions is matched by one of the tables, a smaller and faster

replacement sequence is used.

Data-flow analysis [1] is a technique applied during the semantic and

code generation phases of the compilation process. It improves code

in several ways. First, it eliminates redundant computations

(common sub-expression elimination). Second, it moves expressions

within a loop whose values do not depend on the loop variable to

outside the loop (loop invariance). Third, (also in a loop) it converts

expressions of the form 'K * loop-index' into the equivalent arith-

metic progression 'TMP ffi TMP + K' (strength reduction).

These methods are general. They work regardless of the machine-

specific details such as the representation of an integer. However,

usually the result is not optimal in either space or speed. Super-

optimizer depends on the instruction set, however, the code is

guaranteed to be optimal in space and it does a very good job in

speed as well.

Kmmme and Ackley [4] have written a code generator for the

DEC-10 computer that is based on exhaustive search. Their method

translates each interior node of an expression tree into several viable

instruction sequences. These sequences are then pieced together to

form a set of translations for the entire expression. This set is then

searched to find the cheapest alternative.

In their method, there is a one to one correspondence between the

instructions in the translation and the original expression. For ex-

ample, if there's an add in the expression, there will also be an add

somewhere in the generated code. Superoptimizer has a more global

view of the problem. It 'translates' one sequence of instructions into

another completely different sequence. On the other hand, super-

optimizer can't translate large programs.

The two approaches can be seen as complementing each other. Su-

peroptimizer can be used to prepare the code generation tables used

m Krumme and Ackley's method. Their method can also be incor-

porated into superoptimizer to increase the size of programs that can

be handled. Superoptimizer can generate several short equivalent

sequences for small fragments of the source program, and then

Krumme and Aekley's method would be used to piece these together

and find a short overall sequence.

Kessler [3] has written a code optimization tool, which translates se-

quences of instructions into one single instruction. The super-

optimizer can be seen as a more general tool with broader applica-

tions, since it can transform programs of many instructions to

another one of several instructions. However, Kessler's optimizer

works regardless of program size, and therefore can be easily used to

optimize compiled code. Another difference is that he uses template

matching, while supemptimizer relies on exhaustive search.

Conclusion

We have taken a practical approach to the search for the optimal

program. We have found that the shortest programs are surprising,

often containing sequences of instructions that one would not expect

to see side by side. The signum function is an example of this, and

the min and max functions given in section 1.3 contain a beautiful

combination of the logical and and the arithmetic add.

Exhaustive search is justified by these results, and a probabilistic test

allows programs of practical size to be produced. Although results

are limited to a dozen instructions, those found are already useful.

Many examples of these can be found in Appendix I.

One of the most interesting results is not the programs themselves,

but a better understanding of the interrelations between arithmetic

and logical instructions. Similar ideas seem to come up consistently

in the superoptimized programs, These include the sequence 'add.l

dl,dl; subx.l dl,dl' that extracts the sign of a number in the signum

and abs functions and the sequence 'sub.l dl,dO; and.1 d2,do; add.l

dl ,dO' that selects one of two values depending on a third in the rain

and max functions.

In the future, we hope to explore these ideas further, and compile a

list of useful arithmetic-logical idioms that can be concatenated to

form optimal or near-optimal programs.

Appendix

I. More Interesting Results

1.1. SIGNUM

Function

The signum function has been defined in section 2. Given the 68000

instruction set, four is the minimum number of instructions to com-

pute signum. Interestingly, three suffice on the 8086.

(x in

ax)

cwd (sign extends register ax into dx)

neg ax

adc dx, dx

(slgnum(xl in dx}

124

!~,;~ .....................

Find the absolute value of a number, excluding conditional jumps

from the instruction set.

(x in dO)

move. 1

d0,dl

add. 1 dl, dl

subx. 1 dl, dl

eor.1

dl,d0

sub.

1 dl, dO

(abs ix) in dO)

Notice that although it is longer than the classical method (test;

jump-if-positive; negate), it has no jumps! This might actually be

faster than the classical method on some pipelined machines where

jumps are expensive.

1.3. Max and Min

This program finds the maximum of the unsigned numbers in dO and

dl and returns the answer in dO. The comments on the right show

what's in the various registers during execution and is similar to the

boolean expression checker's method of analysis.

(d0-X, dl-Y) lFlag,ReglIf di>d0 lIf dl<md0

sub.l

dl,d0[ (C,d0) -I (I, X-Y) I (0, X-Y)

subx.1

d2,d2l (C,d2) -1 (1,11..11) [ (0,0...0)

or.1 d2,d01(C,d0) -I (1,11..11)I(0,X-Y)

addx. 1 dl,d0ld0 - IY IX

(dO -

max(X, Y))

This program finds the minimum of the unsigned numbers in d0'and

dl and returns the answer in dO.

(d0-X, dl-Y) lFlag,Regllf dl>d0

liE

dl<-d0

sub.1 dl,d0l (C,d0) -I (1, X-Y) I (0, X-Y}

subx.l d2,d2ld2- 1111,..111

1000...000

and.l d2, d01 d0 - IX-Y

add.1 dl, d0ld0 - IX IY

(dO - min (X, Y) )

Simultaneous min and max.

(d0-X, dl-Y) lFlag, ReglIf dl>d0 lIf dl<-d0

sub.1

dl,d0l (C,d0) -I (1, X-Y) [ C0, X-Y)

subx.1 d2,d2ld2- 1111...111

1000...000

and.1 d0, d2 ld2 - IX-Y

eor.1

d2, d0l d0 - l0 [X-Y

add.l dl,d0ld0 - IY IX

add.l d2, dlldl - IX [Y

(dO -

max(X, Y), dl - rain(X, Y))

1.4. Logical Tests

Here are some logical tests that yield true/false answers. Sequences

such as these have immediate application in a compiler to improve

execution speed. Shown here are the tests for zero and non-zero.

Suitable

for BASIC Suitable for C, PASCAL

dO = 0 if dO -- 0 dO - 0 if dO -- 0

- -1 if dO l- 0 -1 if dO !- 0

neg.

1 dO neg. 1 dO

subx. 1 d0,d0 subx. 1 d0,d0

neg.1 dO

dO - -1 if dO -- 0 dO - 1 if dO ~- 0

0 if dO !- 0 - 0 if dO !- 0

neg.

1 dO neg. 1 dO

subx.

1 d0, dO subx. 1 d0, dO

not. 1 dO addq. 1 1, dO

By prepending 'move.l A,d0; sub.l B,d0' to the abave one can con-

struct tests for A == B and A

1.5. Decimal to Binary

This piece converts a 8 digit BCD number stored in dO, one digit to a

nibble, to binary with the result also in dO. It is the longest sequence

ever generated by superoptimizer, and was actually done in three

sequences to multiply by 10. At first I had superoptimizer compute

the 2 digit BCD to binary conversion function '((dO & 0xF0) >> 4) *

10 + (dO

& OxOF)'. This came out surprisingly short:

(2 dlgit

BCD number

In dO)

move. b

d0,dl

and.b

#$F0,dl

isr.b

#3,dl

sob.b

dl,d0

sub.b

dl,d0

sub.

b dl, dO

(binary equivalent

in dO}

What is actually being computed is

arts

-- dO - 3 * ((dO & 0xF0)/8)

Representing the contents of dO as (H:L) whereH is the upper nibble

and L is the lower nibble we get

dO - 16 * H + L, dO & 0xF0 - 16"H

ans - (16*H+L) - 3 * (16"H/8)

16*H+L - 6*H

- 10*H + L

which is the 2 digit BCD to binary function. Encouraged by this

result, superoptimizer was put to the task of computing first the 4

digit BCD to binary function and then the 8 digit BCD to binary

function. Here is the 8 digit converter:

(8 digit

BCD number

in dO)

move. 1 d0,dl *

and.l #$FOFOFOF0, 11 *

isr.1 #3,dl *

sub.

1 dl, dO *

sub.

1 dl, dO *

sub.

1 dl, dO *

move.

1 d0, dl +

and.

1 #$FF00FF00, dl +

lsr.1 |1,dl +

sub.

1 dl, dO +

Isr.l #2,dl +

sub.

1 dl, dO +

lsr.l #3,dl +

add. 1 dl, dO +

move. 1

d0,dl

swap dl

mulu #$DSf0,dl

sub.

1 dl, dO

(binary

equivalent

in dO)

What is most amazing is the first section (marked by * alongside the

program) It looks exactly like the 2 digit BCD to binary function.

This section computes 4 simultaneous 2 digit BCD to binary func-

tions on adjacent pairs of nibbles and deposits the answer back into

the byte occupied by those nibbles. The second part (marked by +)

computes two simultaneous 2-byte base 100 to binary conversion

functions. Finally, the third part computes the function 'high-word-

of-d0 * 10000 + low-word-of-d0' to complete the conversion.

1.6. Multiplication by Constants

During a two week period, superopdmizer Was used to find minimal

programs that multiply by constants. A sampling of these programs

is included in this section.

An interesting observation is that the average program size increases

as the multiplication constant increaseS, but it increases very slowly.

The average size of programs that multiply by small numbers (less

than 40) is 5 instructions, most programs that multiply by numbers in

the hundreds are 6 to 7 instructions long, and programs that multiply

by thousands are between 7 and 8 instructions long.

dO *- 29 dO *- 39

move.

1 dO, dl move. 1 d0, dl

181.1 #4,d0 lsl.l

#2,d0

sub.

1 dl, dO add. 1 dl, dO

add.l d0.d0 Isl.l #3,d0

sub.

1 dl, dO sub. 1 dl. dO

125

dO *m 625

move.l dO, dl

dO *- 156 Isl.l #2,d0

move.l dO,dl add.l dl,dO

ls1.1 #2,dl ls1.1 #3,dO

add.1 dl,dO sub.1 dl,dO

lsl.l #5,dO ls1.1 #4,dO

sub.1 dl,dO add.1 dl,dO

1.7. Division

by Constants

Division turns out to be difficult to optimize. A general divide by

constant that works for all 32-bit arguments is too long to realize any

time gain over the divide instruction, and is certainly not shorter.

Additionally, there doesn't seem to be any nifty arithmetic-logical

operations that simplify the process. The generated programs just

multiply by the reciprocal of the constant. Since we do an exhaus-

tive search, this negative result can be seen as a confirmation of the

inherent high cost of divisions for the instruction sets considered.

The following programs were generated in an attempt to gain insight

into binary to BCD algorithms, another area where superoptimizer

has had little success. Note that even with the restricted argument

range, these are much longer than the multiply programs.

dO - trunc(dO/lO) for dO - 0..99

move.b dO, d1

add.b dO,dO IdO - 10 * x

isr.b #1,dl Idl -

.1 * x

add.b dl,dO ldO - 10.1 * x

Isr.b #3,dO [dO - .0101 * x

add.b dl,dO IdO - .1101 * x

lsr.b #3, dO IdO - .0001101 * x

dO - trunc(dO/lO0) for dO - 0..9999

move.w

dO, d1

lsr.w #1,dl Idl

- .1 * x

add.w dO, dO [dO - i0 * g

add.w dO, d1 ldl - 10.1 * x

lsr.w #5,dO ldO - .0001 * x

add.w dl,dO ldO - 10.1001 * x

isr.w #8,dl Jnote: you can't isr.w #10,dl

Isr.w #2,dl [dl - .00000000101 * x

sub.w dl,dO IdO - 10.10001111011

isr.w #8, dO IdO - .0000001010001111011 * x

References

[1] Aho, A.V., 8ethi, R, Uilman, J.D.

Compilers Principles, Techniques, and Tools.

Addison Wesley, 1986.

[2] Davidson, J.W. and Fraser, C.W.

Automatic Generation of Peephole Optimizations.

Proceedings of the ACM SIGPLAN ' 84 Symposium on

Compiler Construction,

pages 111-116.

ACM/SIGPLAN, June, 1984.

[3] Kessler, P.B.

Discovering Machine-Specific Code Improvements.

Proceedings of the ACM SIGPLAN ' 86 Symposium on

Compiler Construction,

pages 249-254.

ACM/SIGPLAN, June, 1986.

[4] Krumme, D.W. and Aekley, D.H.

A Practical Method for Code Generation Based On Exhaus-

tive Search.

Proceedings of the ACM SIGPLAN ' 82 Symposium on

Compiler Construction,

pages 185-196.

ACM/SIGPLAN, June, 1982.

126

Superoptimizer: a look at the smallest program

Citations

CGCExplorer: a semi-automated search procedure for provably correct concurrent collectors

Computer systems are dynamical systems.

Adaptive Neural Compilation

Statistical Models for Automatic Performance Tuning

Fast and efficient searches for effective optimization-phase sequences

References

Compilers: Principles, Techniques, and Tools

Automatic generation of peephole optimizations

A practical method for code generation based on exhaustive search

Discovering machine-specific code improvements

Related Papers (5)

Combinatorial sketching for finite programs

Synthesis of loop-free programs

Z3: an efficient SMT solver

Oracle-guided component-based program synthesis

Optimizing for reduced code space using genetic algorithms

Frequently Asked Questions (1)

Q1. What are the contributions in this paper?