scispace - formally typeset
Open AccessProceedings ArticleDOI

Near-optimal instruction selection on dags

Reads0
Chats0
TLDR
NOLTIS is a near-optimal, linear time instruction selection algorithm for DAG expressions that is easy to implement, fast, and effective with a demonstrated average code size improvement of 5.1% compared to the traditional tree decomposition and tiling approach.
Abstract
Instruction selection is a key component of code generation. High quality instruction selection is of particular importance in the embedded space where complex instruction sets are common and code size is a prime concern. Although instruction selection on tree expressions is a well understood and easily solved problem, instruction selection on directed acyclic graphs is NP-complete. In this paper we present NOLTIS, a near-optimal, linear time instruction selection algorithm for DAG expressions. NOLTIS is easy to implement, fast, and effective with a demonstrated average code size improvement of 5.1% compared to the traditional tree decomposition and tiling approach.

read more

Content maybe subject to copyright    Report

Submitted for confidential review to: The 2008 International Symposium on Code Generation and Optimization
Near-Optimal Instruction Selection on DAGs
Instruction selection is a key component of code generation. High quality instruction selection is of particular
importance in the embedded space where complex instruction sets are common and code size is a prime concern.
Although instruction selection on tree expressions is a well understood and easily solved problem, instruction
selection on directed acyclic graphs is NP-complete. In this paper we present NOLTIS, a near-optimal, linear
time instruction selection algorithm for DAG expressions. NOLTIS is easy to implement, fast, and effective with
demonstrated average code size improvements of 1.48%.
1. Introduction
The instruction selection problem is to find an efficient mapping from the compiler’s target-independent inter-
mediate representation (IR) of a program to a target-specific assembly listing. Instruction selection is particularly
important when targeting architectures with complex instruction sets, such as the Intel x86 architecture. In these
architectures there are typically several possible implementations of the same IR operation, each with different
properties (e.g., on x86 an addition of one can be implemented by an inc, add, or lea instruction). CISC ar-
chitectures are popular in the embedded space as a rich, variable-length instruction set can make more efficient
use of limited memory resources.
Code size, which is often ignored in the workstation space, is an important optimization goal when targeting
embedded processors. Embedded designs often have a small, fixed amount of on-chip memory to store and
execute code with. A small difference in code size could necessitate a costly redesign. Instruction selection is
an important part of code size optimization since the instruction selector is responsible for effectively exploiting
the complexity of the target instruction set. Ideally, the instruction selector would be able to find the optimal
mapping from IR code to assembly code.
In the most general case, instruction selection is undecidable since an optimal instruction selector could
solve the halting problem (halting side-effect free code would be replaced by a nop and non-halting code by
an empty infinite loop). Because of this, instruction selection is usually defined as finding an optimal tiling of
the intermediate code with a set of predefined machine instruction tiles. Each tile is a mapping from IR code to
assembly code and has an associated cost. An optimal instruction tiling minimizes the total cost of the tiling. If
the IR is a sequence of expression trees, then efficient optimal tiling algorithms exist [3]. However, if a more
expressive directed acyclic graph (DAG) representation [1] is used the problem becomes NP-complete [4, 8, 33].
In this paper we describe NOLTIS, a near-optimal, linear time instruction selection algorithm for expression
DAGs. NOLTIS builds upon existing instruction selection techniques. Empirically it is nearly optimal (an
1 2007/9/12

optimal result is found more than 99% of the time and the non-optimal solutions are very close to optimal). We
show that NOLTIS significantly decreases code size compared to existing heuristics. The primary contribution
of this paper is our near-optimal, linear time DAG tiling algorithm, NOLTIS. In addition, we
prove that the DAG tiling problem is NP-complete without relying on restrictions such as two-address
instructions, register constraints, or tile label matching,
describe an optimal 0-1 integer programming formulation of the DAG tiling problem,
and provide an extensive evaluation of our algorithm, as well as an evaluation of other DAG tiling heuristics,
including heuristics which first decompose the DAG into trees and then optimally tile the trees.
The remainder of this paper is organized as follows. Section 2 provides additional background and related
work. Section 3 formally defines the problem we solve as well as proves its hardness. Section 4 describes the
NOLTIS algorithm. Section 5 describes a 0-1 integer program formulation of the problem we use to evaluate
the optimality of the NOLTIS algorithm. Section 6 describes our implementation of the algorithm. Section 7
provides detailed empirical comparisons of the NOLTIS algorithm with other techniques. Section 8 discusses
some limitations of our approach and opportunities for future work, and Section 9 provides a summary.
2. Background
The classical approach to instruction selection has been to perform tiling on expression trees. This was initially
done using dynamic programming [3, 36] for a variety of machine models including stack machines, multi-
register machines, infinite register machines, and superscalar machines [7]. These techniques have been further
developed to yield code-generator generators [9, 20] which take a declarative specification of an architecture
and, at compiler-compile time, generate an instruction selector. These code-generator generators either perform
the dynamic programming at compile time [2, 13, 15] or use BURS (bottom-up rewrite system) tree parsing
theory [32, 34] to move the dynamic programming to compiler-compile time [16, 35]. In this paper we describe
the NOLTIS algorithm, which uses an optimal tree matcher to find a near-optimal tiling of an expression DAG.
Although we use a simple compile-time dynamic programming matcher, the NOLTIS algorithm could also
easily use a BURS approach to matching.
Tiling expression DAGs is significantly more difficult than tiling expression trees. DAG tiling has been shown
to be NP-complete for one-register machines [8] and for two-address, infinite register machine models [4]. Two-
address machines have instructions of the form r
i
r
i
op r
j
and r
i
r
j
. Since one of the source operands
gets overwritten, the difficulty lies in minimizing the number of moves inserted to prevent values from being
2 2007/9/12

overwritten. Even with infinite registers and simple, single node tiles, the move minimization problem is NP-
complete although approximation algorithms exist [4]. DAG tiling remains difficult on a three-address, infinite
register machine if the exterior tile nodes have labels that must match [33]. These labels may correspond to value
storage locations (e.g. register classes or memory) or to value types. Such labels are unnecessary if instruction
selection is separated from register allocation and if the IR has already fully determined the value types of edges
in the expression DAG. However, we show in Section 3 that the problem remains NP-complete even without
labels.
Although DAG tiling is NP-complete in general, for some tile sets it can be solved in polynomial time [14].
If a tree tiling algorithm is adapted to tile a DAG and a DAG optimal tile set is used to perform the tiling, the
result is an optimal tiling of the DAG. Although the tile sets for several architectures were found to be DAG
optimal in [14], these tile sets used a simple cost model and the DAG optimality of the tile set is not preserved
if a more complex cost model, such as code size, is used. For example, if the tiles in Figure 1 all had unit cost,
they would be DAG optimal, but with the cost metric shown in Figure 1 they are not.
Traditionally, DAG tiling is performed by using a heuristic to break up the DAG into a forest of expression
trees [5]. More heavyweight solutions, which solve the problem optimally, include using binate covering
[27, 28], using constraint logic programming [26], using integer linear programming [31] or performing
exhaustive search [23]. In addition, we describe a 0-1 integer programming representation of the problem
in Section 5. These techniques all exhibit worst-case exponential behavior. Although these techniques may
be desirable when code quality is of utmost importance and compile-time costs are immaterial, we believe
that our linear time, near-optimal algorithm provides excellent code quality without sacrificing compile-time
performance.
An alternative, non-tiling, method of instruction selection, which is better suited for linear, as opposed to
tree-like, IRs, is to incorporate instruction selection into peephole optimization [10, 11, 17, 18, 24]. In peephole
optimization [30], pattern matching transformations are performed over a small window of instructions, the
“peephole. This window may be either a physical window, where the instructions considered are only those
scheduled next to each other in the current instruction list, or a logical window where the instructions considered
are just those that are data or control related to the instruction currently being scanned. When performing
peephole-based instruction selection, the peepholer simply converts a window of IR operations into target-
specific instructions. If a logical window is being used, then this technique can be considered a heuristic method
for tiling a DAG.
3 2007/9/12

+
add in1, in2 out
cost: 1
+
rc
add const, reg out
cost: 5
c
move const out
cost: 5
+
r
add in, reg out
cost: 1
(a)
+
+ +
x8y
+
+ +
x8y
(b)
Figure 1. An example of instruction selection on a DAG. (a) The tile set used (commutative tiles are omitted).
(b) Two possible tilings. In a simple cost model where every tile has a unit cost the top tiling would be optimal,
but with the cost model shown the lower tiling is optimal.
Instruction selection algorithms have been successfully adapted to solve the technology mapping problem
in the automated circuit design domain [25]. Many domain-specific extensions to the basic tiling algorithm
have been proposed (see [12, 21] for references), but, to the best of our knowledge, all DAG tiling algorithms
proposed in this area have resorted to simple, domain-specific, heuristics for decomposing the DAG into trees
before performing the tiling.
3. Problem Description
Given an expression DAG which represents the computation of a basic block and a set of architecture specific
instruction tiles, we wish to find an optimal tiling of the DAG which corresponds to the minimum cost instruction
sequence. The expression DAG consists of nodes representing operations (such as add or load) and operands
(such as a constant or memory location). We refer to a node with multiple parents as a shared node. The set
of tiles consists of a collection of expression trees each with an assigned cost. If a leaf of an expression tree is
not an operand, it is assumed that the inputs for that leaf node will be available from a register
1
. Similarly, the
1
These are unallocated temporary, note actual hard registers.
4 2007/9/12

output of the tree is assumed to be written to a register. A tile matches a node in the DAG if the root of the tile is
the same kind of node as the DAG node and the subtrees of the tile recursively match the children of the DAG
node. In order for a tiling to be legal and complete, the inputs of each tile must be available as the outputs of
other tiles in the tiling, and all the root nodes of the DAG (those nodes with zero in degree) must be matched
to tiles. The optimal tiling is the legal and complete tiling where the sum of the costs of the tiles is minimized.
More formally, we define an optimal instruction tiling as follows:
Definition Let K be a set of node kinds; G = (V, E) be a directed acyclic graph where each node v V has
a kind k(v) K, a set of children ch(v) 2
V
such that
cch(v)
(v c) E, and a unique ordering of its
children nodes o
v
: ch(v) {1, 2, ...|ch(v)|}; T be a set of tree tiles t
i
= (V
i
, E
i
) where similarly every node
v
i
V
i
has a kind k(v
i
) K
S
{◦} such that k(v
i
) = implies outdegree(v
i
) = 0 (nodes with kind denote
the edge of a tile and, instead of corresponding to an operation or operand, serve to link tiles together), children
nodes ch(v
i
) 2
V
i
, and an ordering o
v
i
; and cost : T Z
+
be a cost function which assigns a cost to each
tree tile. We say a node v V matches tree t
i
with root r V
i
iff k(v) = k(r), |ch(v)| = |ch(r)|, and, for all
c ch(v) and c
i
ch(r), o
v
(c) = o
r
(c
i
) implies that either k(c
i
) = or c matches the tree rooted at c
i
. For a
given matching of v and t
i
and a tree tile node v
i
V
i
, we define m
v,t
i
: V
i
V to return the node in V which
matches with the subtree rooted at v
i
. A mapping f : V 2
T
from each DAG node to a set of tree tiles is legal
iff v V :
t
i
f (v) = v matches t
i
indegree(v) = 0 = |f(v)| > 0
t
i
f (v), v
i
t
i
, k(v
i
) = = |f (m
v,t
i
(v
i
))| > 0
An optimal instruction tiling is a legal mapping f which minimizes
X
vV
X
t
i
f(v)
cost(t
i
)
In some versions of the instruction tiling problem, the name of the storage location a tile writes or reads
is important. For example, some tiles might write to memory or read from a specific register class. In this
case, there is an additional constraint that a tile’s inputs must not only match with other tiles’ outputs, but the
names of the respective input and output must also match. In practice, if instruction selection is performed
5 2007/9/12

Citations
More filters
Book

Modern Compiler Design

TL;DR: "Modern Compiler Design" makes the topic of compiler design more accessible by focusing on principles and techniques of wide application by carefully distinguishing between the essential and the incidental.
Proceedings ArticleDOI

Efficient Selection of Vector Instructions Using Dynamic Programming

TL;DR: In this article, the authors present an auto-vectorization framework in the back-end of a dynamic compiler that not only generates optimized vector code but is also well integrated with the instruction scheduler and register allocator.

Modern Compiler Design 2nd edition

TL;DR: In this example, a rectangle is a rectangle and aSquare is a sphere and a circle is aSquare and a triangle is a circle.
Patent

Efficient Directed Acyclic Graph Pattern Matching To Enable Code Partitioning and Execution On Heterogeneous Processor Cores

TL;DR: In this article, a mobile device can determine the portions of the application code that are best suited for execution on the auxiliary processor based on pattern-matching of directed acyclic graphs (DAGS).
Proceedings ArticleDOI

Synthesizing an instruction selection rule library from semantic specifications

TL;DR: This paper presents a fully automatic approach to create provably correct rule libraries from formal specifications of the instruction set architecture and the compiler IR, and generates a prototype instruction selector that produces code on par with a manually-tuned instruction selector.
References
More filters
Book

Computers and Intractability: A Guide to the Theory of NP-Completeness

TL;DR: The second edition of a quarterly column as discussed by the authors provides a continuing update to the list of problems (NP-complete and harder) presented by M. R. Garey and myself in our book "Computers and Intractability: A Guide to the Theory of NP-Completeness,” W. H. Freeman & Co., San Francisco, 1979.
Book

Compilers: Principles, Techniques, and Tools

TL;DR: This book discusses the design of a Code Generator, the role of the Lexical Analyzer, and other topics related to code generation and optimization.
Book

Synthesis and optimization of digital circuits

TL;DR: This book covers techniques for synthesis and optimization of digital circuits at the architectural and logic levels, i.e., the generation of performance-and-or area-optimal circuits representations from models in hardware description languages.
Related Papers (5)
Frequently Asked Questions (11)
Q1. What have the authors contributed in "Near-optimal instruction selection on dags" ?

In this paper the authors present NOLTIS, a near-optimal, linear time instruction selection algorithm for DAG expressions. 

Although the NOLTIS algorithm is linear in the size of the program, its running time is largely determined by how efficiently the matching of a single node to a set of tiles can be performed. 

In order to solve the nearly half million tiling problems, the authors utilized a cluster of Pentium 4 machines ranging in speed from 2.8Ghz to 3.0Ghz. 

A scheduling pass, which converts the code from DAG form into an assembly listing, attempts to minimize the register pressure of the schedule using Sethi-Ullman numbering [36]. 

If a tree tiling algorithm is adapted to tile a DAG and a DAG optimal tile set is used to perform the tiling, the result is an optimal tiling of the DAG. 

Since one of the source operands gets overwritten, the difficulty lies in minimizing the number of moves inserted to prevent values from being2 2007/9/12overwritten. 

In order to establish the near-optimality of their algorithm, the authors formulate the instruction tiling problem as a 0-1 integer program which can be solved to optimality using a commercial solver. 

Given a Boolean expression consisting of variables u ∈ U and Boolean connectives {∨,∧,¬}, the authors construct an instance of the optimal instruction tiling problem as follows: 

Given an expression DAG which represents the computation of a basic block and a set of architecture specific instruction tiles, the authors wish to find an optimal tiling of the DAG which corresponds to the minimum cost instruction sequence. 

Three benchmarks, 400.perlbench, 453.povray, and 471.omnetpp do not execute properly due to issues unrelated to the instruction selector. 

the second pass of dynamic programming could be made more efficient by intelligently recomputing only portions of the DAG.