scispace - formally typeset

Journal ArticleDOI

Automatic generation of fast optimizing code generators

01 Jun 1988-Vol. 23, Iss: 7, pp 79-84

TL;DR: A system that accepts compact specifications of an intermediate code and target machine and produces program code for an integrated code generator and peephole optimizer, which obviates most inter-phase communication costs.
Abstract: This paper describes a system that accepts compact specifications of an intermediate code and target machine and produces program code for an integrated code generator and peephole optimizer. A compiler for most of C uses this packa.ge. It emits code comparable to PCCI’S, but it runs over five times faster on preliminary benchmarks. This compiler also runs over twice as fast as a version of pcc2 with a hand-coded, VAX-specific code generator. The code generators are produced as follows. A programmer describes a naive code generator by means of a non-procedural specification. The programmer also prepares a machine description for a retargetable peephole optimizer [2]. These two systems are used together to compile a testbed, and the compiler records each peephole optimization as it is made. This record and the specification of the naive code generator are compiled into a fast, integrated code generator and optimizer. This production code generator then takes the place of the slower “training” version. The production code generator and optimizer are integrated to the point that the code to be generated is communicated from one to the other by encoding it in the program counter, which obviates most inter-phase communication costs. Interpretive peephole optimizers have been driven by traces from retargetable peephole optimizers [3] and integrated with interpretive code generators [4], but the current work is distinguished by the production of a hard-coded, optimizing code generator. Historically, retargetable code generators (i.e., those not largely rewritten for each new machine)
Topics: Peephole optimization (70%), Code generation (70%), Compiler (59%), Peephole (54%), Code (cryptography) (51%)

Summary (2 min read)

Introduction

  • This paper describes a system that accepts compact specifications of an intermediate code and target machine and produces program code for an integrated code generator and peephole optimizer.
  • The code generators are produced as follows.
  • The programmer also prepares a machine description for a retargetable peephole optimizer [2] .
  • And notice is given that copying is by permission of the Association for Computing Machinery.
  • Requires a fee and/ or specific permissmn.

Representation

  • Both the training and production code generators accept the same input -an "abstract syntax dag" built by the front end.
  • The front end has propagated types and folded them into the opcodes (e.g. the I prefix flags integer opcodes) so that the back end need not understand t,he frout end's type system, which is typically more complex than the back end's.
  • On the VAX, for example, the subtree rooted at the ISUB above is ultimately replaced with the instruction sub13 -c,,r,r4, and the rest of the tree is replaced with clrl -up+4*7 Cr41.
  • The compiler has not yet accommodated full C, but the size of the table may be estimated.
  • The bindings for the pattern variables %O and %I are never stored in this node because they are available (after register assignment) in the children's vars fields.

Specifying the Code Generator

  • Here are a few lines from the specification that defines the int,ermediate code and the naive VAX code generator:.
  • Opcodes GLOBAL and moval -%O, r%l are leaves, and the remaining opcodes above are binary.
  • The presence of a second number indicates that a register must be allocated to hold the target instruction's result.
  • If the intermediate code uses a constant field -in the examples above, GLOBAL needs the name of a global variable and ILT needs a label number -the front end stores it in the appropriate pattern variable.
  • The automatically generated code generators do the rest.

The Training Code Generator

  • The 3 Initially, the code generator uses only those opcodes that appeared in the specification of the naive code generator, so the initial opcode list holds exactly the two columns from the specification.
  • This case analysis takes the form of an if-then-else chain that may edit the dag and jump off to the case that handles the new opcode.
  • The goto L37 above is really omitted.
  • This results in redundant assignments to the opcode field when rewrite re-encounters a multiply-referenced node that has been previously traversed and rewritten, but moving the assignment saves more than it sacrifices.
  • These arrays are needed by only the register allots tor and output routine, which need to know where to store register names and how many children to traverse.

The Peephole Optimizer and Trace

  • The training routine combine is a retargetable peephole optimizer.
  • It then searches the machine description for an instruction with this combined effect.
  • If the value produced by an instruction is used several times, its cost is divided equally between its users.
  • A full review of this technique is beyond the scope of this paper, but Reference 2 elaborates.
  • The last line above reports that the result register of the new instruction is to be bound to xl.

The Production Code Generator

  • To produce the production system, the code generator generator accepts the trace above and the specification of the naive code generator.
  • It produces an optimizing code generator that is like the naive one presented above, except the opcode list is extended to include all the new instruction variants generated during training, optimizing case analysis is inserted at the head of each case that handles a target instruction, and the call on combine is omitted.
  • It uses b-Wars CO] because %I is the first pattern variable of b that requires local storage.
  • If no optimization applies, control falls off the chain of ifs into code that updates a->op and returns.
  • Case analysis like that above could be generated without training on a testbed.

Discussion

  • Two emerging compilers use the techniques above.
  • One uses a modified peel as a front end and has largely complete back ends for the VAX and the MC68020.
  • The interface between its front end and generated code generators is somewhat less efficient than that shown above.
  • At present, this compiler runs in about 55% of the time taken by peel.
  • In a typical run, Thus rewrite currently takes less than 1% of the time taken by peel.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Automatic Generation of Fast Optimizing Code Generators
Christopher W. Fraser
AT&T Bell Laboratories
Murray Hill, NJ 07974
Alan L. Wendt
Department
of
Computer Science
University of
Arizona
Tucson, AZ 85721
Introduction
This paper describes a system that accepts compact
specifications of an intermediate code and target
machine and produces program code for an inte-
grated code generator and peephole optimizer. A
compiler for most of C uses this packa.ge. It emits
code comparable to PCCI’S, but it runs over five
times faster on preliminary benchmarks. This com-
piler also runs over twice as fast as a version of pcc2
with a hand-coded, VAX-specific code generator.
The code generators are produced as follows. A
programmer describes a naive code generator by
means of a non-procedural specification. The pro-
grammer also prepares a machine description for a
retargetable peephole optimizer [2]. These two sys-
tems are used together to compile a testbed, and the
compiler records each peephole optimization as it is
made. This record and the specification of the naive
code generator are compiled into a fast, integrated
code generator and optimizer. This production code
generator then takes the place of the slower “train-
ing” version. The production code generator and
optimizer are integrated to the point that the code
to be generated is communicated from one to the
other by encoding it in the program counter, which
obviates most inter-phase communication costs.
Interpretive peephole optimizers have been driv-
en by traces from retargetable peephole optimizers
[3] and integrated with interpretive code generators
[4], but the current work is distinguished by the
production of a hard-coded, optimizing code gener-
ator. Historically, retargetable code generators (i.e.,
those not largely rewritten for each new machine)
Permission to copy without fee all or part of this material is granted provided
that the copies are not made or distributed for direct commercial advantage.
the ACM copyright notice and the title of the publicatmn and its date appear.
and notice is given that copying is by permission of the Association for
Computing Machinery. To copy otherwise. or to republish. requires a fee and/
or specific permissmn.
0 1988 ACM O-8979 I-269-1/88/0006/0079 $1.50 r
Language Design and Implementation
Atlanta, Georgia, June 22-24. 1988
have applied a fixed, compile-time interpreter to ta-
bles t1la.t have been automatically generated from
formal specifications [S]. The code generators de-
scribed below interpret no tables, which helps them
run fast.
Representation
Both the training and production code generators
accept the same input - an “abstract syntax dag”
built by the front end. They use dags rather than
trees to accommodate source language features that
implicitly reuse values (like C’s auto-increment and
augmented and multiple assignment) as well as
front ends that eliminate common subexpressions
as they create nodes. Front ends may confine them-
selves to trees if the source language permits and if
common subexpression elimination is not desired.
The front end compiles, for example, the C state-
ment up [r-c+73 =0 into a tree annotated with inter-
mediate code:
ISET + ICONST 0
1
IADD + GLOBAL up
1
IMUL * ICONST 4
1
IADD ---+ ICONST 7
1
ISUB - IDEREF + GLOBAL c
1
IDEREF
1
GLOBAL r
The front end has propagated types and folded
them into the opcodes (e.g. the I prefix flags integer
opcodes) so that the back end need not understand
t,he frout end’s type system, which is typically more
complex than the back end’s. The front end has
79

ato exposed the multiplication implicit in array in-
dexing, so it needs the sizes and alignments of the
basic datatypes, but these are easily isolated in a
small table.
The code generators rewrite dag nodes in place,
replacing the intermediate code with naive and then
optimized assembly code. In the example above,
each node is first rewritten with a single instruc-
tion and then combined with one or more of its de-
scendants via peephole optimization. On the VAX,
for example, the subtree rooted at the ISUB above
is ultimately replaced with the instruction sub13
-c,,r,r4, and the rest of the tree is replaced with
clrl -up+4*7 Cr41. That is, the final tree is:
clrl -up+4*7 Cr41
I
sub13 -c,J-,r4
The clrl occupies the node originally occupied by
the ISET, and the sub13 occupies the node orig-
inally occupied by the ISUB. The actual register
assignment for temporaries (like r4 above) is not
needed during code generation and optimization, so
this task is postponed until these phases complete.
Since the same nodes represent intermediate and
assembly code, the code generator needs one rep-
resentation for both.
Assembly code is text, so
intermediate opcodes are also represented as text.
To avoid the necessity of creating new strings at
compile time, the system abstracts constants, iden-
tifiers, and register numbers out of the text. For
example, the instruction sub13 r2,r3,r4 is rep-
resented with the “skeleton” sub13 r%i ,r%O,r%2
plus bindings for the “pattern variables” xi. The
system enumerates all useful skeletons during train-
ing and stores them in a table. Opcodes are thus
represented as indices into this string table. The
compiler has not yet accommodated full C, but the
size of the table may be estimated. A production C
compiler generated over 26,000 instructions for an
ll,OOO-line testbed, but used fewer than 900 distinct
instruction variants. Intermediate codes and target
instructions that are always optimized out might
increase this figure somewhat, but even so the table
should not exceed 40kb even on the VAX, because
the average skeleton takes less than 25 bytes, in-
cluding four bytes for the pointer to each.
For nodes with n children, the first n pattern
variables denote the result registers of the children,
and bindings for the rest are stored locally. For
example, the instruction sub13 r2,r3,r4 is repre-
sented as a node with the following fields:
op = 39 where opcode[39] =
“sub13 r%l , r%O, rX2”
kids CO] = pointer to first child
kids [I] = pointer to second child
vars CO1 = “4”
The bindings for the pattern variables %O and %I are
never stored in this node because they are available
(after register assignment) in the children’s vars
fields. Pattern variable %2 is stored in vars CO] be-
cause it is the first (and only) pattern variable that
needs local storage; this cell is empty until registers
are assigned.
Specifying the Code Generator
Here are a few lines from the specification that de-
fines the int,ermediate code and the naive VAX code
generator:
%shape 0 1
GLOBAL moval -%O,r%l
ishape 2 2
IADD add13 r%l,r%O,r%2
ISUB sub13 r%l,r%O,r%2
ishape 2
ILT cmpl r%O,r%l; jlss L%2
ISET
movl r%i, (r%O)
. . .
Except for the %shape directives, this specification
forms two columns. The first lists the intermediate
code’s opcodes, and the second gives equivalent but
naive assembly code. Thus the intermediate code
IADD is to be replaced with the VAX skeleton add13
r%l,r%O ,r%2, and the intermediate code ILT (for
“integer less-than”) is to be replaced with the in-
structions cmpl r%O,r%l and jlss L%2.
The %shape directives describe features shared
by the opcodes that follow. Each lists one or two
numbers. The first number specifies the number
of children of subsequent opcodes.
For example,
opcodes GLOBAL and moval -%O, r%l are leaves, and
the remaining opcodes above are binary.
The presence of a second number indicates that
a register must be allocated to hold the target in-
struction’s result. The number specifies the pattern
variable to which the index of the register must
be bound. For example, moval -%O,r%l needs a
register allocated and bound to %l, opcodes add13
r%i,r%O ,r%2 and sub13 r%l ,r%O,r%2 need a reg-
ister allocated and bound to %2, and the remaining
instructions above need no result register at all.
80

When building an abstract syntax dag, the front
end sets the opcode fields using values from the first
column. If the intermediate code uses a constant
field - in the examples above, GLOBAL needs the
name of a global variable and ILT needs a label
number - the front end stores it in the appropriate
pattern variable. The automatically generated code
generators do the rest.
The compiler using these code generators is not
yet complete, but it appears that a naive code gen-
erator for, say, ANSI C will require about three
pages of lines like those above. The register alloca-
tor is retargeted by changing a table if the machine
uses general registers; as with most retargetable
code generators, machines with asymmetric regis-
ter sets may require some recoding.
The Training Code Generator
The specification above is automatically compiled
into a training code generator, whose general out-
lines appear below:
char *opcode[MAXOPSl = (
. . .
/* 36 */ IADD”,
/* 37 */ “add13 r%l,r%O.r%2”
/* 38 */ “ISUB”,
/* 39 */ “sub13 r%l ,r%O ,r%2”
rewrite(a)
register struct node *a;
c
switch (a->op) C
* . .
case 36: L36: /* IADD */
rewrite(a->kids CO1 > ;
rewriteca->kids
Cl1 > ;
a->op = 37;
got0 L37;
case 37: L37:
/* add13 r%l,r%O,r%2 */
(optimizing case analysis to go here)
break;
. . .
3
combine (a) ; (only in training version)
3
Initially, the code generator uses only those opcodes
that appeared in the specification of the naive code
generator, so the initial opcode list holds exactly
the two columns from the specification.
The routine rewrite is the automatically gener-
ated, integrated code generator and optimizer. It
accepts a pointer to a dag decorated with the sim-
ple intermediate code, and it rewrites the dag in
place to represent optimized assembly code. The
string opcodes are recoded as a range of contigu-
ous integers primarily so that rewrite can decode
them with an efficient switch statement. Each op-
code has a distinct case that rewrites its particular
opcode and jumps off to the case that handles the
new opcode just introduced into the dag.
Cases for intermediate codes recursively rewrite
any children, then change the node’s opcode field
to represent the specification’s naive target instruc-
tion, and finally jump to the case for that target in-
struction. The training code generator has no com-
piled code to improve these instructions, so their
cases break out of the switch and call combine,
which is a retargetable peephole optimizer [2].
The production code generator replaces the call
on combine with hard-coded case analysis in the
cases for target instructions. This case analysis
takes the form of an if-then-else chain that may edit
the dag and jump off to the case that handles the
new opcode. An example is presented in due course.
While the code is most easily introduced in the
form above, it is actually optimized slightly. The
code generator generator does not emit redundant
branches, so some cases fall into their successor.
(Recall that C cases exit only on an explicit break.)
For example, the goto L37 above is really omitted.
Also, the pattern above would have the pro-
duction code generator’s case analysis overwrite
a->op (sometimes more than once) before leaving
the switch statement. rewrite reads this field only
upon entry, so it can be safely out-of-date until the
break. Thus the code generator slides the assign-
ment to a->op down just before the break, which
guarantees that each invocation of rewrite sets it
exactly once. In a sense, the program counter en-
codes the proper value for the opcode field while
control remains inside the switch statement. This
results in redundant assignments to the opcode field
when rewrite re-encounters a multiply-referenced
node that has been previously traversed and rewrit-
ten, but moving the assignment saves more than it
sacrifices.
Two arrays not shown parallel the opcode array.
They record for each opcode the number of chil-
dren and the number of the pattern variable that
denotes any result register. rewrite does not need
81

these arrays because their values are compiled into
the code; for example, the IADD case has the proper
number of recursive calls compiled in, so it need ex-
amine no table to learn how many children it has.
These arrays are needed by only the register allots
tor and output routine, which need to know where
to store register names and how many children to
traverse. Flags in the nodes (namely, zeros in the
first unused slots in kids and vms) were used ini-
tially but rejected because maintaining them cost
almost as much as maintaining the useful data.
The Peephole Optimizer and Trace
The training routine combine is a retargetable peep-
hole optimizer. A programmer captures the se-
mantics of the target machine’s instructions in a
bi-directional grammar for translation between as-
sembly language and register transfers. A machine-
independent optimizer uses this machine descrip-
tion to translate pairs and triples of assembler skele-
tons to register transfer skeletons, which it sym-
bolically simulates to learn their combined effect.
It then searches the machine description for an in-
struction with this combined effect. If it finds one
whose cost does not exceed the cost of the original
instructions, it rewrites the dag to use the new in-
struction. If the value produced by an instruction
is used several times, its cost is divided equally be-
tween its users. A full review of this technique is be-
yond the scope of this paper, but Reference 2 elab-
orates. The current implementation adds instruc-
tion costs and machine descriptions re-engineered
so that, for example, the current, nearly complete
VAX description takes only 59 lines.
During training, the optimizer records every op-
timization. For example, when it replaces moval
-%O,r%l and movl (r%O) ,r%l with
movl -%O
,r%l
(the moval is the first child of the movl, so the for-
mer’s result register, r%i, is denoted by r%O in the
latter), the optimizer adds the following record to
its growing optimization trace:
self==movl (r%O),r%l
kidO==moval -%O,r%l
new=movl ,%O,r%l
refs<=l
aO=bO
ai=ai
result=1
The first three lines are self-explanatory. The fourth
reports that, according to the cost metric in the ma
chine description, the optimization pays off only if
the child is referenced just once. The next two lines
note that the new instruction’s %O is the old child’s
x0, and the new instruction’s %I is the old parent’s
Xi. The last line above reports that the result reg-
ister of the new instruction is to be bound to xl.
The specification of the code generator names the
pattern variable corresponding to the result register
for each naive instruction, but the new instruction
above has not been seen before, so the optimizer
must infer and report the pattern variable corre-
sponding to its result register.
The Production Code Generator
To produce the production system, the code gen-
erator generator accepts the trace above and the
specification of the naive code generator. It pro-
duces an optimizing code generator that is like the
naive one presented
above,
except the opcode list is
extended to include all the new instruction variants
generated during training, optimizing case analysis
is inserted at the head of each case that handles a
target instruction, and the call on combine is omit-
ted. Here, for example, are the production versions
of the cases presented above:
case 36: L36: /* IADD */
rewrite(a->kids CO1 ) ;
rewriteca->kids Cl1 > ;
case 37: L37: /* add13 r%l,r%O,r%2 */
b = a->kids CO1 ;
if (
b->op ==
127 /* mull3
$%l,r%O,r%2 */
&& b->vars[Ol == CON4
> (
a->kids CO1 = b-Bkids CO1 ;
got0 L93; /* moval (r%l) MO1 ,r%2 */
3
if ( . . .
a->op = 37 ;
break ;
The conditional looks for a sequence that multi-
plies a register by four and adds it to another reg-
ister. The expression b-Wars CO] == CON4 com-
pares the %I from
mull3
$%i ,r%O,r%2 with the
constant string “4”. It uses b-Wars CO] because %I
is the first pattern variable of b that requires local
storage. Strings are stored uniquely in a constant
table so that an address comparison can be substi-
tuted for what would otherwise be a character-by-
character comparison. If the conditional succeeds,
the dag is rewritten in place, so the “then” arm
overwrit,es a’s fields. In this case, the new values
of %I and X2 are the same as the old ones, so only
82

the change to %O requires code, which promotes a
grandchild.
If the conditional fails, the code generator looks
for another pattern, at the point of the ellipsis
above. If no optimization applies, control falls off
the chain of ifs into code that updates a->op and
returns.
In the optimization above, the new instruction
costs
no
more than the one originally pointed to
by a, so the replacement pays off regardless of the
number of uses of b. When the new instruction
costs more than a, the replacement generally pays
off
when
a + b/n 1 c, where n is the number of uses
of b, and a, b, and c denote the costs of a,
b,
and the
new instruction, respectively. All but n are known
when the compiler is generated, so the code gen-
erator generator computes the largest n for which
the replacement pays off and inserts a clause like
b->count <=
2
in the optimization’s enabling con-
dition (e.g. after the comparison with CON4 above).
Different cost metrics (like space, expected time,
worst-case time) yield different comparands.
To support such comparisons, the code genera-
tor maintains reference counts as it edits the dag.
Consider the example above. It edits the dag so
that a references b->kids[O] instead of
b.
Thus
it is necessary to decrement b->count. If the re-
sult is zero, then all reference counts are correct:
node b is vanishing, but a inherits b’s references to
its children, so these children have the same num-
ber of references before and after the edit. But if
--b->connt exceeds zero, then b is referenced else-
where. It still references its children, and now a will
too, so the reference counts for b’s children must be
incremented. Thus the actual then-clause above is
if (--b->count)
++b->kids CO] ->count ;
a->kids CO] = b->kids CO] ;
got0 L93;
/* moval Ml) Cr%Ol ,r%2 */
In
cases where b points to a leaf, the counts are
maintained with just
--b->count.
And in cases
where the optimization’s enabling condition es-
tablishes that b->count was one, then even the
--b->count is omitted.
Node storage is not reclaimed above because even
the simplest implementation consumed almost as
much time as the case analysis itself. The compiler
thus allocates nodes from a fixed pool and then frees
the entire pool at once at the end of the expression,
block, or procedure. (All three of these compilation
units have been used with this system.)
The case analysis above is close to typical. An
“average” one has two comparisons, two assign-
ments, and a simple
--b->count. A few perform
no
assignments at all, because all important fields
are already in the right place. Of course, an assign-
ment to a->op occurs just before control leaves the
switch.
The code generator is fast.
a
and b are in regis-
ters, so each line above takes just one or two VAX
instructions, and the entire fragment takes just 17.
It has not yet been possible to compile a thorough
testbed, but it appears that a complete rewrite
should not require more than 60kb.
It is also possible to eliminate most of the jumps
above. Rather than ending a change with goto
Ln, the code generator generator could simply place
case n and its code at the point of the goto. Since
most labels are the target
of
exactly one goto, most
of the branches would vanish. This optimization is
performed by some existing compilers.
Case analysis like that above could be generated
without training on a testbed. The trace encodes
simple peephole optimization rules, and there ex-
ist mechanisms for enumerating such rules without
training on a testbed [6, 71. These mechanisms are
immune to training failures, which can cause the
production system to emit code that is sub-optimal
(but never incorrect). Experiments have shown that
training failures are rare [3], and training does have
advantages. It allows the production system to test
only rules known to have been useful, and it al-
lows the code generator generator to sort if-then-
else chains so that the most common patterns are
tested first.
The compiler above gets all of its optimizations
from a record of replacements made by
a
retar-
getable peephole optimizer, but it could easily ac-
cept rewriting rules from other sources a well. The
system has already been adapted to accept hand-
written optimization rules, and it is a natural client
for rules discovered by exhaustive enumeration [$I.
Discussion
Two emerging compilers use the techniques above.
One uses a modified peel as a front end and has
largely complete back ends for the VAX and the
MC68020. The interface between its front end and
generated code generators is somewhat less efficient
than that shown above. At present, this compiler
runs in about 55% of the time taken by peel. The
other compiler uses a new front end and precisely
83

Citations
More filters

Journal ArticleDOI
Cliff Click1, Keith D. Cooper1Institutions (1)
TL;DR: This article presents a framework for combining constant propagation, value numbering, and unreachable-code elimination, and shows how to combine two such frameworks and how to reason about the properties of the resulting framework.
Abstract: Modern optimizing compilers use several passes over a program's intermediate representation to generate good code. Many of these optimizations exhibit a phase-ordering problem. Getting the best code may require iterating optimizations until a fixed point is reached. Combining these phases can lead to the discovery of more facts about the program, exposing more opportunities for optimization. This article presents a framework for describing optimizations. It shows how to combine two such frameworks and how to reason about the properties of the resulting framework. The structure of the frame work provides insight into when a combination yields better results. To make the ideas more concrete, this article presents a framework for combining constant propagation, value numbering, and unreachable-code elimination. It is an open question as to what other frameworks can be combined in this way.

164 citations


Cites background from "Automatic generation of fast optimi..."

  • ...These are called peephole optimizations because the compiler looks through a “peephole”, a very small window, into the code [16, 42, 17, 23, 24]....

    [...]


Journal ArticleDOI
Deborah Whitfield1, Mary Lou Soffa1Institutions (1)
TL;DR: A framework that enables the exploration, both analytically and experimentally, of properties of code-improving transformations and a tool that automatically produces a transformer that implements the transformations specified in Gospel is presented.
Abstract: Although code transformations are routinely applied to improve the performance of programs for both scalar and parallel machines, the properties of code-improving transformations are not well understood. In this article we present a framework that enables the exploration, both analytically and experimentally, of properties of code-improving transformations. The major component of the framework is a specification language, Gospel, for expressing the conditions needed to safely apply a transformation and the actions required to change the code to implement the transformation. The framework includes a technique that facilitates an analytical investigation of code-improving transformations using the Gospel specifications. It also contains a tool, Genesis, that automatically produces a transformer that implements the transformations specified in Gospel. We demonstrate the usefulness of the framework by exploring the enabling and disabling properties of transformations. We first present analytical results on the enabling and disabling properties of a set of code transformations, including both traditional and parallelizing transformations, and then describe experimental results showing the types of transformations and the enabling and disabling interactions actually found in a set of programs.

124 citations


Proceedings ArticleDOI
06 Apr 2008
TL;DR: NOLTIS is a near-optimal, linear time instruction selection algorithm for DAG expressions that is easy to implement, fast, and effective with a demonstrated average code size improvement of 5.1% compared to the traditional tree decomposition and tiling approach.
Abstract: Instruction selection is a key component of code generation. High quality instruction selection is of particular importance in the embedded space where complex instruction sets are common and code size is a prime concern. Although instruction selection on tree expressions is a well understood and easily solved problem, instruction selection on directed acyclic graphs is NP-complete. In this paper we present NOLTIS, a near-optimal, linear time instruction selection algorithm for DAG expressions. NOLTIS is easy to implement, fast, and effective with a demonstrated average code size improvement of 5.1% compared to the traditional tree decomposition and tiling approach.

29 citations


Patent
Sorav Bensal1, Alex Aiken1Institutions (1)
12 Feb 2008
Abstract: An efficient binary translator uses peephole translation rules to directly translate executable code from one instruction set to another. In a preferred embodiment, the translation rules are generated using superoptimization techniques that enable the translator to automatically learn translation rules for translating code from the source to target instruction set architecture.

25 citations


Proceedings ArticleDOI
Christopher W. Fraser1Institutions (1)
21 Jun 1989
TL;DR: Each specification is compiled into a fast, monolithic C program that accepts dags, annotated with intermediate code, and generates, optimizes, and emits code for the target machine.
Abstract: Each specification is compiled into a fast, monolithic C program that accepts dags (directed acyclic graphs) annotated with intermediate code, and generates, optimizes, and emits code for the target machine. The code generators are used with a front end for ANSI C. The resulting compilers emit code similar to pcci’s, but they run about twice as fast. The compilers are in use by small research groups at Bell Labs and Princeton University and by classes at Princeton.

24 citations


Cites background from "Automatic generation of fast optimi..."

  • ...This project grew out of experience with a system that tracked the operation of a high-tech peephole optimizer and generated a hard-coded code generator from the trace [ 7 ]....

    [...]


References
More filters

Journal ArticleDOI
Henry Massalin1Institutions (1)
01 Oct 1987
Abstract: Given an instruction set, the superoptimizer finds the shortest program to compute a function. Startling programs have been generated, many of them engaging in convoluted bit-fiddling bearing little resemblance to the source programs which defined the functions. The key idea in the superoptimizer is a probabilistic test that makes exhaustive searches practical for programs of useful size. The search space is defined by the processor's instruction set, which may include the whole set, but it is typically restricted to a subset. By constraining the instructions and observing the effect on the output program, one can gain insight into the design of instruction sets. In addition, superoptimized programs may be used by peephole optimizers to improve the quality of generated code, or by assembly language programmers to improve manually written code.

245 citations


Journal ArticleDOI
TL;DR: Experiments indicate that naive code generators can give good code if used with PO, a peephole optimizer that uses a symbolic machine description to simulate pairs of adjacent instructions, replacing them, where possible, with an equivalent single instruction.
Abstract: Peephole optimizers improve object code by replacing certain sequences of instructions with better sequences. This paper describes PO, a peephole optimizer that uses a symbolic machine description to simulate pairs of adjacent instructions, replacing them, where possible, with an equivalent single instruction. As a result of this organization, PO is machine independent and can be described formally and concisely: when PO is finished, no instruction, and no pair of adjacent instructions, can be replaced with a cheaper single instruction that has the same effect. This thoroughness allows PO to relieve code generators of much case analysis; for example, they might produce only load/add-register sequences and rely on PO to, where possible, discard them in favor or add-memory, add-immediate, or increment instructions. Experiments indicate that naive code generators can give good code if used with PO.

119 citations


"Automatic generation of fast optimi..." refers background in this paper

  • ...The programmer also prepares a machine description for a retargetable peephole optimizer [2]....

    [...]

  • ...The training code generator has no compiled code to improve these instructions, so their cases break out of the switch and call combine, which is a retargetable peephole optimizer [2]....

    [...]


Journal ArticleDOI
TL;DR: A classlficaUon of automated retargetable code generation techniques and a survey of the work on these techmques is presented.
Abstract: A classlficaUon of automated retargetable code generation techniques and a survey of the work on these techmques is presented Retargetable code generation research.is classified into three categories: interpretive code generation, pattern-matched code generation, and table-driven code generatlon. Interpretive code generation approaches generate code for a virtual machine and then expand into real target code Pattern-matched code generation approaches separate the machine description from the code generation algorithm. Tabledriven code generation approaches employ a formal machine description and use a code-' generator generator to produce code generators automatically. An analysis Qf these techniques and a critique of automatic code generation algorithms are presented,

95 citations


Proceedings ArticleDOI
01 Jul 1986
TL;DR: LR parsers can be made to run 6 to 10 times as fast as the best table-interpretive LR parsers, and a factor of 2 to 4 increase in total table size can be expected, depending upon whether syntactic error recovery is required.
Abstract: LR parsers can be made to run 6 to 10 times as fast as the best table-interpretive LR parsers. The resulting parse time is negligible compared to the time required by the remainder of a typical compiler containing the parser.A parsing speed of 1/2 million lines per minute on a computer similar to a VAX 11/780 was achieved, up from an interpretive speed of 40,000 lines per minute. A speed of 240,000 lines per minute on an Intel 80286 was achieved, up from an interpretive speed of 37,000 lines per minute.The improvement is obtained by translating the parser's finite state control into assembly language. States become code memory addresses. The current input symbol resides in a register and a quick sequence of register-constant comparisons determines the next state, which is merely jumped to. The parser's push-down stack is implemented directly on a hardware stack. The stack contains code memory addresses rather than the traditional state numbers.The strongly-connected components of the directed graph induced by the parser's terminal and nonterminal transitions are examined to determine a typically small subset of the states that require parse-time stack-overflow-check code when hardware does not provide the check automatically.The increase in speed is at the expense of space: a factor of 2 to 4 increase in total table size can be expected, depending upon whether syntactic error recovery is required.

55 citations


"Automatic generation of fast optimi..." refers methods in this paper

  • ...Pennello has described a technique for replacing an LR parsing table and its interpreter with equivalent optimized assembly code [9]....

    [...]


Journal ArticleDOI
William M. Waite1Institutions (1)
TL;DR: This paper examines a common design for a lexical analyser and its supporting modules and recommends several specific design and optimization strategies that are also valid for software other than lexical analyseers.
Abstract: This paper examines a common design for a lexical analyser and its supporting modules. An implementation of the design was tuned to produce the best possible performance. In effect, many of the optimizations that one would expect of a production-quality compiler were carried out by hand. After measuring the cost of tokenizing two large programs with this version, the code was ‘detuned’ to remove specific optimizations and the measurements were repeated. In all cases, the basic algorithm was unchanged, so that the difference in cost is an indication of the effectiveness of the optimization. Comparisons were also made with a tool-generated lexical analyser for the same task. On the basis of the measurements, several specific design and optimization strategies are recommended. These recommendations are also valid for software other than lexical analysers.

35 citations


Network Information
Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
20201
20131
20112
20091
20083
20071