How is the NOLTIS algorithm based on the size of the program?

Although the NOLTIS algorithm is linear in the size of the program, its running time is largely determined by how efficiently the matching of a single node to a set of tiles can be performed.

How many Pentium 4 machines were used to solve the tiling problem?

In order to solve the nearly half million tiling problems, the authors utilized a cluster of Pentium 4 machines ranging in speed from 2.8Ghz to 3.0Ghz.

What is the way to minimize the register pressure of the schedule?

A scheduling pass, which converts the code from DAG form into an assembly listing, attempts to minimize the register pressure of the schedule using Sethi-Ullman numbering [36].

How many moves does it take to prevent a value from being overwritten?

Since one of the source operands gets overwritten, the difficulty lies in minimizing the number of moves inserted to prevent values from being2 2007/9/12overwritten.

How can the authors solve the instruction tiling problem?

In order to establish the near-optimality of their algorithm, the authors formulate the instruction tiling problem as a 0-1 integer program which can be solved to optimality using a commercial solver.

What is the way to solve the instruction tiling problem?

Given a Boolean expression consisting of variables u ∈ U and Boolean connectives {∨,∧,¬}, the authors construct an instance of the optimal instruction tiling problem as follows:

What is the way to tile a DAG?

Given an expression DAG which represents the computation of a basic block and a set of architecture specific instruction tiles, the authors wish to find an optimal tiling of the DAG which corresponds to the minimum cost instruction sequence.

What are the three benchmarks that do not execute properly?

Three benchmarks, 400.perlbench, 453.povray, and 471.omnetpp do not execute properly due to issues unrelated to the instruction selector.

How can the second pass of dynamic programming be made more efficient?

the second pass of dynamic programming could be made more efficient by intelligently recomputing only portions of the DAG.

(Open Access) Near-optimal instruction selection on dags (2008) | David Ryan Koes

Q: What have the authors contributed in "Near-optimal instruction selection on dags" ?

In this paper the authors present NOLTIS, a near-optimal, linear time instruction selection algorithm for DAG expressions.

Submitted for conﬁdential review to: The 2008 International Symposium on Code Generation and Optimization

Near-Optimal Instruction Selection on DAGs

Instruction selection is a key component of code generation. High quality instruction selection is of particular

importance in the embedded space where complex instruction sets are common and code size is a prime concern.

Although instruction selection on tree expressions is a well understood and easily solved problem, instruction

selection on directed acyclic graphs is NP-complete. In this paper we present NOLTIS, a near-optimal, linear

time instruction selection algorithm for DAG expressions. NOLTIS is easy to implement, fast, and effective with

demonstrated average code size improvements of 1.48%.

1. Introduction

The instruction selection problem is to ﬁnd an efﬁcient mapping from the compiler’s target-independent inter-

mediate representation (IR) of a program to a target-speciﬁc assembly listing. Instruction selection is particularly

important when targeting architectures with complex instruction sets, such as the Intel x86 architecture. In these

architectures there are typically several possible implementations of the same IR operation, each with different

properties (e.g., on x86 an addition of one can be implemented by an inc, add, or lea instruction). CISC ar-

chitectures are popular in the embedded space as a rich, variable-length instruction set can make more efﬁcient

use of limited memory resources.

Code size, which is often ignored in the workstation space, is an important optimization goal when targeting

embedded processors. Embedded designs often have a small, ﬁxed amount of on-chip memory to store and

execute code with. A small difference in code size could necessitate a costly redesign. Instruction selection is

an important part of code size optimization since the instruction selector is responsible for effectively exploiting

the complexity of the target instruction set. Ideally, the instruction selector would be able to ﬁnd the optimal

mapping from IR code to assembly code.

In the most general case, instruction selection is undecidable since an optimal instruction selector could

solve the halting problem (halting side-effect free code would be replaced by a nop and non-halting code by

an empty inﬁnite loop). Because of this, instruction selection is usually deﬁned as ﬁnding an optimal tiling of

the intermediate code with a set of predeﬁned machine instruction tiles. Each tile is a mapping from IR code to

assembly code and has an associated cost. An optimal instruction tiling minimizes the total cost of the tiling. If

the IR is a sequence of expression trees, then efﬁcient optimal tiling algorithms exist [3]. However, if a more

expressive directed acyclic graph (DAG) representation [1] is used the problem becomes NP-complete [4, 8, 33].

In this paper we describe NOLTIS, a near-optimal, linear time instruction selection algorithm for expression

DAGs. NOLTIS builds upon existing instruction selection techniques. Empirically it is nearly optimal (an

1 2007/9/12

optimal result is found more than 99% of the time and the non-optimal solutions are very close to optimal). We

show that NOLTIS signiﬁcantly decreases code size compared to existing heuristics. The primary contribution

of this paper is our near-optimal, linear time DAG tiling algorithm, NOLTIS. In addition, we

•

prove that the DAG tiling problem is NP-complete without relying on restrictions such as two-address

instructions, register constraints, or tile label matching,

•

describe an optimal 0-1 integer programming formulation of the DAG tiling problem,

•

and provide an extensive evaluation of our algorithm, as well as an evaluation of other DAG tiling heuristics,

including heuristics which ﬁrst decompose the DAG into trees and then optimally tile the trees.

The remainder of this paper is organized as follows. Section 2 provides additional background and related

work. Section 3 formally deﬁnes the problem we solve as well as proves its hardness. Section 4 describes the

NOLTIS algorithm. Section 5 describes a 0-1 integer program formulation of the problem we use to evaluate

the optimality of the NOLTIS algorithm. Section 6 describes our implementation of the algorithm. Section 7

provides detailed empirical comparisons of the NOLTIS algorithm with other techniques. Section 8 discusses

some limitations of our approach and opportunities for future work, and Section 9 provides a summary.

2. Background

The classical approach to instruction selection has been to perform tiling on expression trees. This was initially

done using dynamic programming [3, 36] for a variety of machine models including stack machines, multi-

developed to yield code-generator generators [9, 20] which take a declarative speciﬁcation of an architecture

and, at compiler-compile time, generate an instruction selector. These code-generator generators either perform

the dynamic programming at compile time [2, 13, 15] or use BURS (bottom-up rewrite system) tree parsing

theory [32, 34] to move the dynamic programming to compiler-compile time [16, 35]. In this paper we describe

the NOLTIS algorithm, which uses an optimal tree matcher to ﬁnd a near-optimal tiling of an expression DAG.

Although we use a simple compile-time dynamic programming matcher, the NOLTIS algorithm could also

easily use a BURS approach to matching.

Tiling expression DAGs is signiﬁcantly more difﬁcult than tiling expression trees. DAG tiling has been shown

to be NP-complete for one-register machines [8] and for two-address, inﬁnite register machine models [4]. Two-

address machines have instructions of the form r

← r

op r

and r

← r

. Since one of the source operands

gets overwritten, the difﬁculty lies in minimizing the number of moves inserted to prevent values from being

2 2007/9/12

overwritten. Even with inﬁnite registers and simple, single node tiles, the move minimization problem is NP-

complete although approximation algorithms exist [4]. DAG tiling remains difﬁcult on a three-address, inﬁnite

storage locations (e.g. register classes or memory) or to value types. Such labels are unnecessary if instruction

selection is separated from register allocation and if the IR has already fully determined the value types of edges

in the expression DAG. However, we show in Section 3 that the problem remains NP-complete even without

labels.

Although DAG tiling is NP-complete in general, for some tile sets it can be solved in polynomial time [14].

If a tree tiling algorithm is adapted to tile a DAG and a DAG optimal tile set is used to perform the tiling, the

result is an optimal tiling of the DAG. Although the tile sets for several architectures were found to be DAG

optimal in [14], these tile sets used a simple cost model and the DAG optimality of the tile set is not preserved

if a more complex cost model, such as code size, is used. For example, if the tiles in Figure 1 all had unit cost,

they would be DAG optimal, but with the cost metric shown in Figure 1 they are not.

Traditionally, DAG tiling is performed by using a heuristic to break up the DAG into a forest of expression

trees [5]. More heavyweight solutions, which solve the problem optimally, include using binate covering

[27, 28], using constraint logic programming [26], using integer linear programming [31] or performing

exhaustive search [23]. In addition, we describe a 0-1 integer programming representation of the problem

in Section 5. These techniques all exhibit worst-case exponential behavior. Although these techniques may

be desirable when code quality is of utmost importance and compile-time costs are immaterial, we believe

that our linear time, near-optimal algorithm provides excellent code quality without sacriﬁcing compile-time

performance.

An alternative, non-tiling, method of instruction selection, which is better suited for linear, as opposed to

tree-like, IRs, is to incorporate instruction selection into peephole optimization [10, 11, 17, 18, 24]. In peephole

optimization [30], pattern matching transformations are performed over a small window of instructions, the

“peephole.” This window may be either a physical window, where the instructions considered are only those

scheduled next to each other in the current instruction list, or a logical window where the instructions considered

are just those that are data or control related to the instruction currently being scanned. When performing

peephole-based instruction selection, the peepholer simply converts a window of IR operations into target-

speciﬁc instructions. If a logical window is being used, then this technique can be considered a heuristic method

for tiling a DAG.

3 2007/9/12

add in1, in2 → out

cost: 1

add const, reg → out

cost: 5

move const → out

cost: 5

add in, reg → out

cost: 1

(a)

+ +

x8y

+ +

x8y

(b)

Figure 1. An example of instruction selection on a DAG. (a) The tile set used (commutative tiles are omitted).

(b) Two possible tilings. In a simple cost model where every tile has a unit cost the top tiling would be optimal,

but with the cost model shown the lower tiling is optimal.

Instruction selection algorithms have been successfully adapted to solve the technology mapping problem

in the automated circuit design domain [25]. Many domain-speciﬁc extensions to the basic tiling algorithm

have been proposed (see [12, 21] for references), but, to the best of our knowledge, all DAG tiling algorithms

proposed in this area have resorted to simple, domain-speciﬁc, heuristics for decomposing the DAG into trees

before performing the tiling.

3. Problem Description

Given an expression DAG which represents the computation of a basic block and a set of architecture speciﬁc

instruction tiles, we wish to ﬁnd an optimal tiling of the DAG which corresponds to the minimum cost instruction

sequence. The expression DAG consists of nodes representing operations (such as add or load) and operands

(such as a constant or memory location). We refer to a node with multiple parents as a shared node. The set

of tiles consists of a collection of expression trees each with an assigned cost. If a leaf of an expression tree is

not an operand, it is assumed that the inputs for that leaf node will be available from a register

. Similarly, the

These are unallocated temporary, note actual hard registers.

4 2007/9/12

output of the tree is assumed to be written to a register. A tile matches a node in the DAG if the root of the tile is

the same kind of node as the DAG node and the subtrees of the tile recursively match the children of the DAG

node. In order for a tiling to be legal and complete, the inputs of each tile must be available as the outputs of

other tiles in the tiling, and all the root nodes of the DAG (those nodes with zero in degree) must be matched

to tiles. The optimal tiling is the legal and complete tiling where the sum of the costs of the tiles is minimized.

More formally, we deﬁne an optimal instruction tiling as follows:

Deﬁnition Let K be a set of node kinds; G = (V, E) be a directed acyclic graph where each node v ∈ V has

a kind k(v) ∈ K, a set of children ch(v) ∈ 2

such that ∀

c∈ch(v)

(v → c) ∈ E, and a unique ordering of its

children nodes o

: ch(v) → {1, 2, ...|ch(v)|}; T be a set of tree tiles t

= (V

, E

) where similarly every node

∈ V

has a kind k(v

) ∈ K

{◦} such that k(v

) = ◦ implies outdegree(v

) = 0 (nodes with kind ◦ denote

the edge of a tile and, instead of corresponding to an operation or operand, serve to link tiles together), children

nodes ch(v

) ∈ 2

, and an ordering o

; and cost : T → Z

be a cost function which assigns a cost to each

tree tile. We say a node v ∈ V matches tree t

with root r ∈ V

iff k(v) = k(r), |ch(v)| = |ch(r)|, and, for all

c ∈ ch(v) and c

∈ ch(r), o

) implies that either k(c

) = ◦ or c matches the tree rooted at c

. For a

given matching of v and t

and a tree tile node v

∈ V

, we deﬁne m

v,t

: V

→ V to return the node in V which

matches with the subtree rooted at v

. A mapping f : V → 2

from each DAG node to a set of tree tiles is legal

iff ∀v ∈ V :

∈ f (v) =⇒ v matches t

indegree(v) = 0 =⇒ |f(v)| > 0

∀t

∈ f (v), ∀v

∈ t

, k(v

) = ◦ =⇒ |f (m

v,t

))| > 0

An optimal instruction tiling is a legal mapping f which minimizes

v∈V

∈f(v)

cost(t

)

In some versions of the instruction tiling problem, the name of the storage location a tile writes or reads

is important. For example, some tiles might write to memory or read from a speciﬁc register class. In this

case, there is an additional constraint that a tile’s inputs must not only match with other tiles’ outputs, but the

names of the respective input and output must also match. In practice, if instruction selection is performed

5 2007/9/12

Near-optimal instruction selection on dags

Citations

Modern Compiler Design

Efficient Selection of Vector Instructions Using Dynamic Programming

Modern Compiler Design 2nd edition

Efficient Directed Acyclic Graph Pattern Matching To Enable Code Partitioning and Execution On Heterogeneous Processor Cores

Synthesizing an instruction selection rule library from semantic specifications

References

Johnson: Computers and Intractability-A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness

Compilers: Principles, Techniques, and Tools

Computers and Intractability: A Guide to the Theory of NP-Completeness

Synthesis and optimization of digital circuits

Related Papers (5)

Optimal code selection in DAGs

Generalized instruction selection using SSA-graphs

LLVM: a compilation framework for lifelong program analysis & transformation

Code generation using tree matching and dynamic programming

BURG: fast optimal instruction selection and tree parsing

Frequently Asked Questions (11)

Q1. What have the authors contributed in "Near-optimal instruction selection on dags" ?

Q2. How is the NOLTIS algorithm based on the size of the program?

Q3. How many Pentium 4 machines were used to solve the tiling problem?

Q4. What is the way to minimize the register pressure of the schedule?

Q5. What is the result of a tree tiling algorithm?

Q6. How many moves does it take to prevent a value from being overwritten?

Q7. How can the authors solve the instruction tiling problem?

Q8. What is the way to solve the instruction tiling problem?

Q9. What is the way to tile a DAG?

Q10. What are the three benchmarks that do not execute properly?

Q11. How can the second pass of dynamic programming be made more efficient?