Journal Article•DOI•

Automatic generation of fast optimizing code generators

Christopher W. Fraser¹, Alan L. Wendt²•Institutions (2)

01 Jun 1988-Vol. 23, Iss: 7, pp 79-84

TL;DR: A system that accepts compact specifications of an intermediate code and target machine and produces program code for an integrated code generator and peephole optimizer, which obviates most inter-phase communication costs.

read less

Abstract: This paper describes a system that accepts compact specifications of an intermediate code and target machine and produces program code for an integrated code generator and peephole optimizer. A compiler for most of C uses this packa.ge. It emits code comparable to PCCI’S, but it runs over five times faster on preliminary benchmarks. This compiler also runs over twice as fast as a version of pcc2 with a hand-coded, VAX-specific code generator. The code generators are produced as follows. A programmer describes a naive code generator by means of a non-procedural specification. The programmer also prepares a machine description for a retargetable peephole optimizer [2]. These two systems are used together to compile a testbed, and the compiler records each peephole optimization as it is made. This record and the specification of the naive code generator are compiled into a fast, integrated code generator and optimizer. This production code generator then takes the place of the slower “training” version. The production code generator and optimizer are integrated to the point that the code to be generated is communicated from one to the other by encoding it in the program counter, which obviates most inter-phase communication costs. Interpretive peephole optimizers have been driven by traces from retargetable peephole optimizers [3] and integrated with interpretive code generators [4], but the current work is distinguished by the production of a hard-coded, optimizing code generator. Historically, retargetable code generators (i.e., those not largely rewritten for each new machine)

...read moreread less

Summary (2 min read)

Jump to: [Introduction] – [Representation] – [Specifying the Code Generator] – [The Training Code Generator] – [The Peephole Optimizer and Trace] – [The Production Code Generator] and [Discussion]

Introduction

This paper describes a system that accepts compact specifications of an intermediate code and target machine and produces program code for an integrated code generator and peephole optimizer.
The code generators are produced as follows.
The programmer also prepares a machine description for a retargetable peephole optimizer [2] .
And notice is given that copying is by permission of the Association for Computing Machinery.
Requires a fee and/ or specific permissmn.

Representation

Both the training and production code generators accept the same input -an "abstract syntax dag" built by the front end.
The front end has propagated types and folded them into the opcodes (e.g. the I prefix flags integer opcodes) so that the back end need not understand t,he frout end's type system, which is typically more complex than the back end's.
On the VAX, for example, the subtree rooted at the ISUB above is ultimately replaced with the instruction sub13 -c,,r,r4, and the rest of the tree is replaced with clrl -up+4*7 Cr41.
The compiler has not yet accommodated full C, but the size of the table may be estimated.
The bindings for the pattern variables %O and %I are never stored in this node because they are available (after register assignment) in the children's vars fields.

Specifying the Code Generator

Here are a few lines from the specification that defines the int,ermediate code and the naive VAX code generator:.
Opcodes GLOBAL and moval -%O, r%l are leaves, and the remaining opcodes above are binary.
The presence of a second number indicates that a register must be allocated to hold the target instruction's result.
If the intermediate code uses a constant field -in the examples above, GLOBAL needs the name of a global variable and ILT needs a label number -the front end stores it in the appropriate pattern variable.
The automatically generated code generators do the rest.

The Training Code Generator

The 3 Initially, the code generator uses only those opcodes that appeared in the specification of the naive code generator, so the initial opcode list holds exactly the two columns from the specification.
This case analysis takes the form of an if-then-else chain that may edit the dag and jump off to the case that handles the new opcode.
The goto L37 above is really omitted.
This results in redundant assignments to the opcode field when rewrite re-encounters a multiply-referenced node that has been previously traversed and rewritten, but moving the assignment saves more than it sacrifices.
These arrays are needed by only the register allots tor and output routine, which need to know where to store register names and how many children to traverse.

The Peephole Optimizer and Trace

The training routine combine is a retargetable peephole optimizer.
It then searches the machine description for an instruction with this combined effect.
If the value produced by an instruction is used several times, its cost is divided equally between its users.
A full review of this technique is beyond the scope of this paper, but Reference 2 elaborates.
The last line above reports that the result register of the new instruction is to be bound to xl.

The Production Code Generator

To produce the production system, the code generator generator accepts the trace above and the specification of the naive code generator.
It produces an optimizing code generator that is like the naive one presented above, except the opcode list is extended to include all the new instruction variants generated during training, optimizing case analysis is inserted at the head of each case that handles a target instruction, and the call on combine is omitted.
It uses b-Wars CO] because %I is the first pattern variable of b that requires local storage.
If no optimization applies, control falls off the chain of ifs into code that updates a->op and returns.
Case analysis like that above could be generated without training on a testbed.

Discussion

Two emerging compilers use the techniques above.
One uses a modified peel as a front end and has largely complete back ends for the VAX and the MC68020.
The interface between its front end and generated code generators is somewhat less efficient than that shown above.
At present, this compiler runs in about 55% of the time taken by peel.
In a typical run, Thus rewrite currently takes less than 1% of the time taken by peel.

Did you find this useful? Give us your feedback

Content maybe subject to copyright Report

Automatic Generation of Fast Optimizing Code Generators

Christopher W. Fraser

AT&T Bell Laboratories

Murray Hill, NJ 07974

Alan L. Wendt

Department

Computer Science

University of

Arizona

Tucson, AZ 85721

Introduction

This paper describes a system that accepts compact

specifications of an intermediate code and target

machine and produces program code for an inte-

grated code generator and peephole optimizer. A

compiler for most of C uses this packa.ge. It emits

code comparable to PCCI’S, but it runs over five

times faster on preliminary benchmarks. This com-

piler also runs over twice as fast as a version of pcc2

with a hand-coded, VAX-specific code generator.

The code generators are produced as follows. A

programmer describes a naive code generator by

means of a non-procedural specification. The pro-

grammer also prepares a machine description for a

retargetable peephole optimizer [2]. These two sys-

tems are used together to compile a testbed, and the

compiler records each peephole optimization as it is

made. This record and the specification of the naive

code generator are compiled into a fast, integrated

code generator and optimizer. This production code

generator then takes the place of the slower “train-

ing” version. The production code generator and

optimizer are integrated to the point that the code

to be generated is communicated from one to the

other by encoding it in the program counter, which

obviates most inter-phase communication costs.

Interpretive peephole optimizers have been driv-

en by traces from retargetable peephole optimizers

[3] and integrated with interpretive code generators

[4], but the current work is distinguished by the

production of a hard-coded, optimizing code gener-

ator. Historically, retargetable code generators (i.e.,

those not largely rewritten for each new machine)

Permission to copy without fee all or part of this material is granted provided

that the copies are not made or distributed for direct commercial advantage.

the ACM copyright notice and the title of the publicatmn and its date appear.

and notice is given that copying is by permission of the Association for

Computing Machinery. To copy otherwise. or to republish. requires a fee and/

or specific permissmn.

0 1988 ACM O-8979 I-269-1/88/0006/0079 $1.50 r

Language Design and Implementation

Atlanta, Georgia, June 22-24. 1988

have applied a fixed, compile-time interpreter to ta-

bles t1la.t have been automatically generated from

formal specifications [S]. The code generators de-

scribed below interpret no tables, which helps them

run fast.

Representation

Both the training and production code generators

accept the same input - an “abstract syntax dag”

built by the front end. They use dags rather than

trees to accommodate source language features that

implicitly reuse values (like C’s auto-increment and

augmented and multiple assignment) as well as

front ends that eliminate common subexpressions

as they create nodes. Front ends may confine them-

selves to trees if the source language permits and if

common subexpression elimination is not desired.

The front end compiles, for example, the C state-

ment up [r-c+73 =0 into a tree annotated with inter-

mediate code:

ISET + ICONST 0

IADD + GLOBAL up

IMUL * ICONST 4

IADD ---+ ICONST 7

ISUB - IDEREF + GLOBAL c

IDEREF

GLOBAL r

The front end has propagated types and folded

them into the opcodes (e.g. the I prefix flags integer

opcodes) so that the back end need not understand

t,he frout end’s type system, which is typically more

complex than the back end’s. The front end has

ato exposed the multiplication implicit in array in-

dexing, so it needs the sizes and alignments of the

basic datatypes, but these are easily isolated in a

small table.

The code generators rewrite dag nodes in place,

replacing the intermediate code with naive and then

optimized assembly code. In the example above,

each node is first rewritten with a single instruc-

tion and then combined with one or more of its de-

scendants via peephole optimization. On the VAX,

for example, the subtree rooted at the ISUB above

is ultimately replaced with the instruction sub13

-c,,r,r4, and the rest of the tree is replaced with

clrl -up+4*7 Cr41. That is, the final tree is:

clrl -up+4*7 Cr41

sub13 -c,J-,r4

The clrl occupies the node originally occupied by

the ISET, and the sub13 occupies the node orig-

inally occupied by the ISUB. The actual register

assignment for temporaries (like r4 above) is not

needed during code generation and optimization, so

this task is postponed until these phases complete.

Since the same nodes represent intermediate and

assembly code, the code generator needs one rep-

resentation for both.

Assembly code is text, so

intermediate opcodes are also represented as text.

To avoid the necessity of creating new strings at

compile time, the system abstracts constants, iden-

tifiers, and register numbers out of the text. For

example, the instruction sub13 r2,r3,r4 is rep-

resented with the “skeleton” sub13 r%i ,r%O,r%2

plus bindings for the “pattern variables” xi. The

system enumerates all useful skeletons during train-

ing and stores them in a table. Opcodes are thus

represented as indices into this string table. The

compiler has not yet accommodated full C, but the

size of the table may be estimated. A production C

compiler generated over 26,000 instructions for an

ll,OOO-line testbed, but used fewer than 900 distinct

instruction variants. Intermediate codes and target

instructions that are always optimized out might

increase this figure somewhat, but even so the table

should not exceed 40kb even on the VAX, because

the average skeleton takes less than 25 bytes, in-

cluding four bytes for the pointer to each.

For nodes with n children, the first n pattern

variables denote the result registers of the children,

and bindings for the rest are stored locally. For

example, the instruction sub13 r2,r3,r4 is repre-

sented as a node with the following fields:

op = 39 where opcode[39] =

“sub13 r%l , r%O, rX2”

kids CO] = pointer to first child

kids [I] = pointer to second child

vars CO1 = “4”

The bindings for the pattern variables %O and %I are

never stored in this node because they are available

(after register assignment) in the children’s vars

fields. Pattern variable %2 is stored in vars CO] be-

cause it is the first (and only) pattern variable that

needs local storage; this cell is empty until registers

are assigned.

Specifying the Code Generator

Here are a few lines from the specification that de-

fines the int,ermediate code and the naive VAX code

generator:

%shape 0 1

GLOBAL moval -%O,r%l

ishape 2 2

IADD add13 r%l,r%O,r%2

ISUB sub13 r%l,r%O,r%2

ishape 2

ILT cmpl r%O,r%l; jlss L%2

ISET

movl r%i, (r%O)

. . .

Except for the %shape directives, this specification

forms two columns. The first lists the intermediate

code’s opcodes, and the second gives equivalent but

naive assembly code. Thus the intermediate code

IADD is to be replaced with the VAX skeleton add13

r%l,r%O ,r%2, and the intermediate code ILT (for

“integer less-than”) is to be replaced with the in-

structions cmpl r%O,r%l and jlss L%2.

The %shape directives describe features shared

by the opcodes that follow. Each lists one or two

numbers. The first number specifies the number

of children of subsequent opcodes.

For example,

opcodes GLOBAL and moval -%O, r%l are leaves, and

the remaining opcodes above are binary.

The presence of a second number indicates that

a register must be allocated to hold the target in-

struction’s result. The number specifies the pattern

variable to which the index of the register must

be bound. For example, moval -%O,r%l needs a

r%i,r%O ,r%2 and sub13 r%l ,r%O,r%2 need a reg-

ister allocated and bound to %2, and the remaining

instructions above need no result register at all.

When building an abstract syntax dag, the front

end sets the opcode fields using values from the first

column. If the intermediate code uses a constant

field - in the examples above, GLOBAL needs the

name of a global variable and ILT needs a label

number - the front end stores it in the appropriate

pattern variable. The automatically generated code

generators do the rest.

The compiler using these code generators is not

yet complete, but it appears that a naive code gen-

erator for, say, ANSI C will require about three

pages of lines like those above. The register alloca-

tor is retargeted by changing a table if the machine

uses general registers; as with most retargetable

code generators, machines with asymmetric regis-

ter sets may require some recoding.

The Training Code Generator

The specification above is automatically compiled

into a training code generator, whose general out-

lines appear below:

char *opcode[MAXOPSl = (

. . .

/* 36 */ ” IADD”,

/* 37 */ “add13 r%l,r%O.r%2”

/* 38 */ “ISUB”,

/* 39 */ “sub13 r%l ,r%O ,r%2”

rewrite(a)

switch (a->op) C

* . .

case 36: L36: /* IADD */

rewrite(a->kids CO1 > ;

rewriteca->kids

Cl1 > ;

a->op = 37;

got0 L37;

case 37: L37:

/* add13 r%l,r%O,r%2 */

(optimizing case analysis to go here)

break;

. . .

combine (a) ; (only in training version)

Initially, the code generator uses only those opcodes

that appeared in the specification of the naive code

generator, so the initial opcode list holds exactly

the two columns from the specification.

The routine rewrite is the automatically gener-

ated, integrated code generator and optimizer. It

accepts a pointer to a dag decorated with the sim-

ple intermediate code, and it rewrites the dag in

place to represent optimized assembly code. The

string opcodes are recoded as a range of contigu-

ous integers primarily so that rewrite can decode

them with an efficient switch statement. Each op-

code has a distinct case that rewrites its particular

opcode and jumps off to the case that handles the

new opcode just introduced into the dag.

Cases for intermediate codes recursively rewrite

any children, then change the node’s opcode field

to represent the specification’s naive target instruc-

tion, and finally jump to the case for that target in-

struction. The training code generator has no com-

piled code to improve these instructions, so their

cases break out of the switch and call combine,

which is a retargetable peephole optimizer [2].

The production code generator replaces the call

on combine with hard-coded case analysis in the

cases for target instructions. This case analysis

takes the form of an if-then-else chain that may edit

the dag and jump off to the case that handles the

new opcode. An example is presented in due course.

While the code is most easily introduced in the

form above, it is actually optimized slightly. The

code generator generator does not emit redundant

branches, so some cases fall into their successor.

(Recall that C cases exit only on an explicit break.)

For example, the goto L37 above is really omitted.

Also, the pattern above would have the pro-

duction code generator’s case analysis overwrite

a->op (sometimes more than once) before leaving

the switch statement. rewrite reads this field only

upon entry, so it can be safely out-of-date until the

break. Thus the code generator slides the assign-

ment to a->op down just before the break, which

guarantees that each invocation of rewrite sets it

exactly once. In a sense, the program counter en-

codes the proper value for the opcode field while

control remains inside the switch statement. This

results in redundant assignments to the opcode field

when rewrite re-encounters a multiply-referenced

node that has been previously traversed and rewrit-

ten, but moving the assignment saves more than it

sacrifices.

Two arrays not shown parallel the opcode array.

They record for each opcode the number of chil-

dren and the number of the pattern variable that

denotes any result register. rewrite does not need

these arrays because their values are compiled into

the code; for example, the IADD case has the proper

number of recursive calls compiled in, so it need ex-

amine no table to learn how many children it has.

These arrays are needed by only the register allots

tor and output routine, which need to know where

to store register names and how many children to

traverse. Flags in the nodes (namely, zeros in the

first unused slots in kids and vms) were used ini-

tially but rejected because maintaining them cost

almost as much as maintaining the useful data.

The Peephole Optimizer and Trace

The training routine combine is a retargetable peep-

hole optimizer. A programmer captures the se-

mantics of the target machine’s instructions in a

bi-directional grammar for translation between as-

sembly language and register transfers. A machine-

independent optimizer uses this machine descrip-

tion to translate pairs and triples of assembler skele-

tons to register transfer skeletons, which it sym-

bolically simulates to learn their combined effect.

It then searches the machine description for an in-

struction with this combined effect. If it finds one

whose cost does not exceed the cost of the original

instructions, it rewrites the dag to use the new in-

struction. If the value produced by an instruction

is used several times, its cost is divided equally be-

tween its users. A full review of this technique is be-

yond the scope of this paper, but Reference 2 elab-

orates. The current implementation adds instruc-

tion costs and machine descriptions re-engineered

so that, for example, the current, nearly complete

VAX description takes only 59 lines.

During training, the optimizer records every op-

timization. For example, when it replaces moval

-%O,r%l and movl (r%O) ,r%l with

movl -%O

,r%l

(the moval is the first child of the movl, so the for-

mer’s result register, r%i, is denoted by r%O in the

latter), the optimizer adds the following record to

its growing optimization trace:

self==movl (r%O),r%l

kidO==moval -%O,r%l

new=movl ,%O,r%l

refs<=l

aO=bO

ai=ai

result=1

The first three lines are self-explanatory. The fourth

reports that, according to the cost metric in the ma

chine description, the optimization pays off only if

the child is referenced just once. The next two lines

note that the new instruction’s %O is the old child’s

x0, and the new instruction’s %I is the old parent’s

Xi. The last line above reports that the result reg-

ister of the new instruction is to be bound to xl.

The specification of the code generator names the

pattern variable corresponding to the result register

for each naive instruction, but the new instruction

above has not been seen before, so the optimizer

must infer and report the pattern variable corre-

sponding to its result register.

The Production Code Generator

To produce the production system, the code gen-

erator generator accepts the trace above and the

specification of the naive code generator. It pro-

duces an optimizing code generator that is like the

naive one presented

above,

except the opcode list is

extended to include all the new instruction variants

generated during training, optimizing case analysis

is inserted at the head of each case that handles a

target instruction, and the call on combine is omit-

ted. Here, for example, are the production versions

of the cases presented above:

case 36: L36: /* IADD */

rewrite(a->kids CO1 ) ;

rewriteca->kids Cl1 > ;

case 37: L37: /* add13 r%l,r%O,r%2 */

b = a->kids CO1 ;

if (

b->op ==

127 /* mull3

$%l,r%O,r%2 */

&& b->vars[Ol == CON4

> (

a->kids CO1 = b-Bkids CO1 ;

got0 L93; /* moval (r%l) MO1 ,r%2 */

if ( . . .

a->op = 37 ;

break ;

The conditional looks for a sequence that multi-

plies a register by four and adds it to another reg-

ister. The expression b-Wars CO] == CON4 com-

pares the %I from

mull3

$%i ,r%O,r%2 with the

constant string “4”. It uses b-Wars CO] because %I

is the first pattern variable of b that requires local

storage. Strings are stored uniquely in a constant

table so that an address comparison can be substi-

tuted for what would otherwise be a character-by-

character comparison. If the conditional succeeds,

the dag is rewritten in place, so the “then” arm

overwrit,es a’s fields. In this case, the new values

of %I and X2 are the same as the old ones, so only

the change to %O requires code, which promotes a

grandchild.

If the conditional fails, the code generator looks

for another pattern, at the point of the ellipsis

above. If no optimization applies, control falls off

the chain of ifs into code that updates a->op and

returns.

In the optimization above, the new instruction

costs

more than the one originally pointed to

by a, so the replacement pays off regardless of the

number of uses of b. When the new instruction

costs more than a, the replacement generally pays

off

when

a + b/n 1 c, where n is the number of uses

of b, and a, b, and c denote the costs of a,

and the

new instruction, respectively. All but n are known

when the compiler is generated, so the code gen-

erator generator computes the largest n for which

the replacement pays off and inserts a clause like

b->count <=

in the optimization’s enabling con-

dition (e.g. after the comparison with CON4 above).

Different cost metrics (like space, expected time,

worst-case time) yield different comparands.

To support such comparisons, the code genera-

tor maintains reference counts as it edits the dag.

Consider the example above. It edits the dag so

that a references b->kids[O] instead of

Thus

it is necessary to decrement b->count. If the re-

sult is zero, then all reference counts are correct:

node b is vanishing, but a inherits b’s references to

its children, so these children have the same num-

ber of references before and after the edit. But if

--b->connt exceeds zero, then b is referenced else-

where. It still references its children, and now a will

too, so the reference counts for b’s children must be

incremented. Thus the actual then-clause above is

if (--b->count)

++b->kids CO] ->count ;

a->kids CO] = b->kids CO] ;

got0 L93;

/* moval Ml) Cr%Ol ,r%2 */

cases where b points to a leaf, the counts are

maintained with just

--b->count.

And in cases

where the optimization’s enabling condition es-

tablishes that b->count was one, then even the

--b->count is omitted.

Node storage is not reclaimed above because even

the simplest implementation consumed almost as

much time as the case analysis itself. The compiler

thus allocates nodes from a fixed pool and then frees

the entire pool at once at the end of the expression,

block, or procedure. (All three of these compilation

units have been used with this system.)

The case analysis above is close to typical. An

“average” one has two comparisons, two assign-

ments, and a simple

--b->count. A few perform

assignments at all, because all important fields

are already in the right place. Of course, an assign-

ment to a->op occurs just before control leaves the

switch.

The code generator is fast.

and b are in regis-

ters, so each line above takes just one or two VAX

instructions, and the entire fragment takes just 17.

It has not yet been possible to compile a thorough

testbed, but it appears that a complete rewrite

should not require more than 60kb.

It is also possible to eliminate most of the jumps

above. Rather than ending a change with goto

Ln, the code generator generator could simply place

case n and its code at the point of the goto. Since

most labels are the target

exactly one goto, most

of the branches would vanish. This optimization is

performed by some existing compilers.

Case analysis like that above could be generated

without training on a testbed. The trace encodes

simple peephole optimization rules, and there ex-

ist mechanisms for enumerating such rules without

training on a testbed [6, 71. These mechanisms are

immune to training failures, which can cause the

production system to emit code that is sub-optimal

(but never incorrect). Experiments have shown that

training failures are rare [3], and training does have

advantages. It allows the production system to test

only rules known to have been useful, and it al-

lows the code generator generator to sort if-then-

else chains so that the most common patterns are

tested first.

The compiler above gets all of its optimizations

from a record of replacements made by

retar-

getable peephole optimizer, but it could easily ac-

cept rewriting rules from other sources a well. The

system has already been adapted to accept hand-

written optimization rules, and it is a natural client

for rules discovered by exhaustive enumeration [$I.

Discussion

Two emerging compilers use the techniques above.

One uses a modified peel as a front end and has

largely complete back ends for the VAX and the

MC68020. The interface between its front end and

generated code generators is somewhat less efficient

than that shown above. At present, this compiler

runs in about 55% of the time taken by peel. The

other compiler uses a new front end and precisely

HTML Viewer

Frequently Asked Questions (1)

Q1. What have the authors contributed in "Automatic generation of fast optimizing code generators" ?

This paper describes a system that accepts compact specifications of an intermediate code and target machine and produces program code for an integrated code generator and peephole optimizer. The code generators are produced as follows.

Automatic generation of fast optimizing code generators

Summary (2 min read)

Introduction

Representation

Specifying the Code Generator

The Training Code Generator

The Peephole Optimizer and Trace

The Production Code Generator

Discussion

Citations

Cites background from "Automatic generation of fast optimi..."

References

"Automatic generation of fast optimi..." refers methods in this paper

"Automatic generation of fast optimi..." refers background or methods in this paper

"Automatic generation of fast optimi..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (1)

Q1. What have the authors contributed in "Automatic generation of fast optimizing code generators" ?