scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Minimization of Memory Traffic in High-Level Synthesis

TL;DR: A new transformation for the scheduling of memory-access operations in High-Level Synthesis that removes load and store instructions which become redundant or unnecessary during the transformation of loops, resulting in a significant reduction in the number of memory operations.
Abstract: This paper presents a new transformation for the scheduling of memory-access operations in High-Level Synthesis. This transformation is suited to memory-intensive applications with synthesized designs containing a secondary store accessed by explicit instructions. Such memory-intensive behaviors are commonly observed in video compression, image convolution, hydrodynamics and mechatronics. Our transformation removes load and store instructions which become redundant or unnecessary during the transformation of loops. The advantage of this reduction is the decrease of secondary memory bandwidth demands. This technique is implemented in our Percolation-Based Scheduler which we used to conduct experiments on core numerical benchmarks. Our results demonstrate a significant reduction in the number of memory operations and an increase in performance on these benchmarks.

Summary (3 min read)

1 Introduction

  • Traditionally, one of the goals in High-Level Synthesis (HLS) is the minimization of storage requirements for synthesized designs [5, 7, 15, 19].
  • Examples of such behaviors are abundant in video compression, image convolution, speech recognition, hydrodynamics and mechatronics.
  • The authors strategy for optimizing memory access is to eliminate the redundancy found in memory interaction when scheduling memory operations.
  • The authors transformation has many signi cant bene ts.

2 Program Model

  • In their model, a program is represented by a control data- ow graph where each node corresponds to operations performed in one time step and the edges between nodes represent ow-of-control.
  • Parallelizing a program involves the compaction of multiple operations into one node (subject to resource availability).
  • For the purpose of dependency analysis on memory operations, each contains a symbolic expression which is a string that formulates the indexing function without iv's.
  • The result can be a constant or another variable (which is recursively folded, until the beginning of the loop is reached).

3 Memory Disambiguation

  • Memory disambiguation is the ability to determine if two memory access instructions are aliases for the same location [1].
  • Next, operation two's expression is subtracted from operation one's.
  • If the resultant expression is not linear then the disambiguator returns CANT TELL, otherwise the gcd of coe cients of the equation is solved for.
  • If the disambiguator cannot determine that two memory operations refer to the same location, the authors follow the conservative approach that there is a dependence between them (i.e., no optimization can be done).

4 Reducing Memory Tra c

  • The authors solution to reducing the amount of memory trafc in HLS is to make explicit the redundancy in memory interaction within the behavior and eliminate those extraneous operations.
  • The authors technique is employed during scheduling rather than as a pre-pass or post-pass phase; a pre-pass phase may not remove all redundancy since other optimizations can create opportunities that may not have otherwise existed while a post-pass phase cannot derive as compact a schedule since operations eliminated on the critical path allow further schedule re nement.

4.1 Algorithm in Detail

  • Fig. 4 shows the main algorithm for removing unnecessary memory operations.
  • This function is invoked in their Percolation-based scheduler [17] by the move op transform (or any suitable local code motion routine in other systems) when moving a memory operation into a previous step that contains other memory operations.
  • If so, then the function remove inv mem op tries to remove it.
  • If two operations refer to the same location then the appropriate action is taken depending upon their types.
  • The load-after-load and load-after-store cases will be discussed shortly.

4.1.1 Removing Invariants

  • Removing invariant memory operations is slightly di erent from general loop invariant removal.
  • Traditional loop invariant removal moves an invariant into a pre-loop time step.
  • Conceptually, invariant loads are \inputs" to the loop, while invariant stores are \outputs.".
  • Therefore, loads must be placed in pre-loop steps and stores must be placed in loop exit steps.
  • If these conditions are met, then the operation can be hoisted out of the loop.

4.1.2 Load-After-Load Optimization

  • The load-after-load optimization is applied in situations where a load operation accesses a memory value that has been previously loaded and no intervening modi cation has occurred to that location's value (i.e. there is no intermittent store).
  • As a matter of correctness, move operations are only inserted into the nodes function remove inv mem op(op, from step) begin /* Conditions necessary for hoisting: */ /* 1. from step must dominate all exit nodes.
  • */ end if /* Delete op and update necessary information.
  • In mem op0s latency eld if op0s de nition reaches those nodes.
  • The number of registers used does not increase (a proof appears in [9]).

4.1.3 Load-After-Store Optimization

  • The load-after-store optimization is used to remove a load operation which accesses a value that a store operation previously wrote to the memory.
  • Therefore, the precise case when the load-after-store optimization fails to remove a redundant load is composed of three conditions: 1. A move in this step results in a read-wrong.
  • No free storage cell exists in the previous time step.
  • In practice, this situation occurs very infrequently.
  • This algorithm determines which step to place a move operation.

4.2 Example

  • Applying their transformation to the behavior in the introduction will eliminate the loads of b[i][j-1] and b[i][j2] and the invariant load and store to a[i].
  • For the invariants the store cannot be removed unless the load is also removed.
  • Once the load is hoisted, the store can then be hoisted as well.

5 Experiments and Results

  • Four memory-intensive benchmarks were used to study their transformation: three numerical algorithms (pre x sums, tri-diagonal elimination and general linear recurrence equations) which are core routines in many algorithms (as discussed in the introduction) adapated from [10] and a two-dimensional hydrodynamics implicit computation adapted from [20].
  • In the rst, schedules were generated with the number of memory ports constrained between one and four and no functional unit (FU) constraints.
  • The results of experiments one and two are found in Tables 1 and 2, respectively.
  • When a load operation is removed, the dependent operations can move to earlier time steps.

6 Conclusion

  • In this paper the authors have presented a new local scheduler transformation which optimizes the accessing of a secondary memory thereby reducing memory tra c.
  • This method is based on the redundancy found in memory access instructions both within and across iterations of a loop.
  • The authors have presented their algorithm in detail and provided results of its application to several benchmarks which demonstrate the utility and power of this memory minimization transformation.
  • The authors believe that this transformation, when used in conjunction with other traditional HLS transformations, should yield better designs for memory-intensive applications.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Minimization of Memory Trac in High-Level Synthesis
David J. Kolson Alexandru Nicolau Nikil Dutt
Department of Information and Computer Science
University of California, Irvine
Irvine, CA 92717-3425
Abstract
This paper presents a new transformation for the
scheduling of memory-access operations in High-Level
Synthesis. This transformation is suited to memory-
intensive applications with synthesized designs contain-
ing a secondary store accessed by explicit instructions.
Such memory-intensive behaviors are commonly observed
in video compression, image convolution, hydrodynamics
and mechatronics. Our transformation removes load and
store instructions which b ecome redundant or unneces-
sary during the transformation of loops. The advantage
of this reduction is the decrease of secondary memory
bandwidth demands. This technique is implemented in
our Percolation-Based Scheduler whichwe used to con-
duct exp eriments on core numerical b enchmarks. Our re-
sults demonstrate a signicant reduction in the number
of memory op erations and an increase in p erformance on
these benchmarks.
1 Intro duction
Traditionally, one of the goals in High-Level Synthesis
(HLS) is the minimization of storage requirements for syn-
thesized designs [5, 7, 15, 19]. As the fo cus of HLS shifts
towards the synthesis of designs for inherently memory-
intensive behaviors [6, 8, 14, 16, 18], memory optimiza-
tion becomes crucial to obtaining acceptable performance.
Examples of such behaviors are abundant in video com-
pression, image convolution, speech recognition, hydrody-
namics and mechatronics. The memory-intensive nature of
these b ehaviors necessitates the use of a secondary store
(e.g., a memory system), since a primary store (e.g., reg-
ister storage) suciently large enough would b e impracti-
cal. This memory is explicitl y addressed in a synthesized
system by memory operations containing indexing func-
tions. However, due to bottlenecks in the access of memory
systems, memory accessing operations must b e eectively
scheduled so as to improve p erformance.
Our strategy for optimizing memory access is to elim-
inate the redundancy found in memory interaction when
scheduling memory operations. Such redundancies can be
found within loop iterations, p ossibly over multiple paths,
This work was supported in part by NSF grant
CCR8704367, ONR grant N0001486K0215 and a UCI Faculty
Research Grant.
as well as across lo op iterations. During loop pip elining
[17, 13] redundancy is exhibited when values loaded from
and/or stored to the memory in one iteration are loaded
from and/or stored to the memory in future iterations. For
example, consider the b ehavior:
for
i
=1 to
N
for
j
=1 to
N
a
[
i
]:=
a
[
i
]+
1
2
(
b
[
i
][
j
1] +
b
[
i
][
j
2])
b
[
i
][
j
]:=
F
(
b
[
i
][
j
])
end
end
The inner lo op would normally require four load and two
store instructions p er iteration. However, after applicatio n
of our transformation, the inner loop contains only one
load and one store.
Previous work in reducing memory accessing balances
load op erations with computation [3]. However, their al-
gorithm only removes redundant loads and only deals with
single-dimensiona l arrays and single ow-of-control. In [6]
a model for access dep endencies is used to optimize mem-
ory system organization. In [16 ] background memory man-
agement is discussed, but no details of an algorithm are
present. Therefore, it is not clear what approach is taken in
determining redundancy removal, nor the general applic-
itivity of the technique. Related work includes the mini-
mization of registers [8, 14], the minimization of the inter-
connection b etween secondary-store and functional units
[11], and the assignment of arrays to storage [18].
Our transformation has many signicant benets. By
eliminating unnecessary memory op erations that occur on
the critical dep endency path through the co de, the perfor-
mance of the resulting schedule can increase dramatically:
the length of the critical path can b e shortened, thus gen-
erating more compact schedules and reducing code size.
Also, due to our transformation's lo cal nature, it integrates
easily into other parallelizin g transformations [4, 13]. An-
other benet is the p ossible savings in hardware due to
the decrease in memory bandwidth requirements and/or
the exploration of more cost-eective implementations.
2 Program Mo del
In our mo del, a program is represented by a control
data-ow graph where each node corresp onds to opera-
tions performed in one time step and the edges between

nodes representow-of-control. Initially, each node con-
tains only one operation. Parallelizing a program involves
the compaction of multiple op erations into one no de (sub-
ject to resource availabil ity).
Memory operations contain an indexing function, com-
posed of a constant base and induction variables (iv's),
and either a
destination
for load operations or a
value
for
store operations
1
. The semantics of a load operation are
that issuing a load reserves the
destination
(local storage)
at issuance time (i.e.,
destination
is unavailable during the
load's latency). For the purpose of dependency analysis on
memory operations, each contains a
symbolic expression
which is a string that formulates the indexing function
without iv's. (During lo op pipelini ng these expressions
must b e updated w.r.t. iv's.)
The
initial analysis
algorithm in Fig. 1 computes ini-
tial program information. Detecting loop invariants and
iv's and building iv use-def chains can be done with stan-
dard algorithms found in [1] and stored into a database.
The function
build symbolic exprs
creates symbolic expres-
sions for each memory operation in the program by getting
the iv denitions that dene the current op eration's index-
ing function and deriving an expression for each. Next, the
base of the memory structure is added to each expression.
An op eration is then annotated with its expression, com-
bining multiple expressions into one of the form \( (
expr1
)
or
...
or
(
exprN
))."
The function
derive expr
constructs the expression
\(LoopId * Const)" if iv is self-referencing (e.g.
i=i+
const
) where Lo opId is the identier of the lo op over which
iv inducts and Const is a constant derived from the con-
stant in the iv operation multiplied by a data size and pos-
sibly other variables and constants
2
. In the introductory
example, the data size for the
j
loop is the array element
size and for the
i
loop is the size of a column (or row) of
data. If iv is dened in terms of another iv (e.g.
i=j+
1
, where
j
is an iv) then recursive calls are made on all
denitions of that other iv. In this case, marking of iv's is
necessary to detect cyclic dependencies which are handled
by a technique called
variable folding
. Essentially variable
folding determines an initial value of a variable on input
to the lo op or resulting from the rst iteration (i.e. val-
ues which are loop-carried are not considered) from the
reverse-ow of the graph. The result can be a constant
or another variable (which is recursively folded, until the
beginning of the lo op is reached).
Fig. 2 shows a sample behavior and its CDFG annotated
with symbolic expresssions. The load from
A
builds the
expression \((8 * L0) + (4 * L0))" which is the addition
of 2 (the const for iv
j
) times 4 (the element size) and 1
(the const for iv
i
) times 4. The second lo op over
k
adds
the expression \(400 * L1)." Finally the base address of
A
is added. For the store op eration, the expression \(12 *
1
We use the term
argument
to refer to
destination
if the
operations is a load or to
value
if the operation is a store.
2
For claritywe present a simplied algorithm. More complex
analysis (based on [12]) has been implemented in our scheduler.
Procedure
initial analysis
(program)
begin
/* Detect lo op invariants. */
/* Detect induction variables in program. */
/* Build iv use-def chains in program. */
build symbolic exprs
(program)
end
initial analysis
Procedure
build symbolic exprs
(program)
begin
foreach
mem op
in
program
/* Set iv defs to the possible iv defs found in DB. */
foreach
iv group
in
iv defs
new expr =
derive expr
(iv group)
/* Add Base of mem op to new expr. */
/* Annotate mem op with new expr. */
end
end
end
build symbolic exprs
function
derive expr
(iv group)
begin
foreach
iv
in
iv group
if
(/* iv is marked */)
then
/* Do variable folding. */
else if
(/* iv is self-referencing */)
/* Return the string \(Const * LoopId))" */
else
/* Mark iv, then recursively derive*/
/* the iv that denes this iv. */
end if
end
end
derive expr
Figure 1: Initial program analysis.
L0)" is created which is 1 times 4 times 3 (the constantin
the behavior). Due to the +1 in the index expression, the
constant 4 is added to the base address of
A
.
3 Memory Disambiguation
Memory disambiguation is the ability to determine if
two memory access instructions are
aliases
for the same
location [1]. In our context, we are interested in static
memory disambiguation, or the ability to disambiguate
memory references during scheduling. In the general case,
memory indexing functions can be arbitrarily complex due
to explicit and implicit induction variables and lo op index
increments. Therefore, a simplistic pattern matching ap-
proach to matching loads and stores over lo op iterations
cannot provide the p ower of memory aliasing analysis. For
instance, in the following behavior, if arrays
a
and
b
are
aliases:
for
i
=1 to
N
a
[
i
]:=
1
2
b
[
i
1] +
1
3
a
[
i
2]
C oef
[
i
]:=
b
[
i
]+1
end
pattern matching will fail to nd the redundancy.
In our scheduler, memory disambiguation is based on
the well-known
greatest common divisor
, or GCD test [2].

.
.
.
for k = 1 to 100
j = 0
for i = 1 to 100
b = A[k][i+j]
j = j + 2
A[k][3i+1] = value
end
end
Load
1
+
j+
2
i3
*
1
-
4
*
k
400
+
+
baseA
+
*
4
*
+
ij
400 k
*
+
baseA
+
4
Store
((baseA + 4) + ((12 * L0) + (400 * L1)))
(baseA + (((8 * L0) + (4 * L0)) + (400 * L1)))
Figure 2: Symbolic expressions example.
Performing memory disambiguation on two operations,
op1
and
op2
,involves determining if the dierence equa-
tion:
(op1's symbolic expression) - (op2's symbolic expres-
sion) = 0
has anyinteger solution. Fig. 3 contains an
algorithm to disambiguate two memory references.
This algorithm works by iterating over all expressions
of op erations one and two, thereby testing each possible
address that the two op erations can have. The rst step
in disambiguating two expressions is to convert them into
the sum of pro ducts form \((a * b)+. . . +(y * z))." Next,
operation two's expression is subtracted from op eration
one's. If the resultant expression is not linear then the
disambiguator returns CANT TELL, otherwise the gcd of
coecients of the equation is solved for. If the gcd does
not divide all terms, there is no dependence between
op1
and
op2
.
Returning to the example in Fig. 2, if the load from
iteration
i+1
is overlapped with the store from iteration
i
,
the disambiguator determines that the up dated expression
for the load minus the store's expression is 0, exposing
the redundancy in loading a value which has just b een
computed.
If the disambiguator cannot determine that two mem-
ory operations refer to the same location, we follow the
conservative approach that there is a dependence between
them (i.e., no optimization can b e done). Assertions
(source-level statements such as certain arrays reside in
disjoint memory space, absolute bounds on loops, etc.)
can be used to allay this. Also, providing the user with
the information the dismabiguator has derived and query-
ing for a result to the dep endence question is an alternate,
interactive approach.
function
disambiguate
(op1, op2)
begin
foreach
ex1
in
op1's expressions
foreach
ex2
in
op2's expressions
/* Convert ex1 and ex1 into sum of pro ducts form. */
/* Set expr to ex1 - ex2. */
/* If expr is not linear, return CANT TELL. */
/* Solve GCD of co ecients of expr. */
/* If sol and divides all terms return EQUAL */
/* else return NOT EQUAL. */
end
end
end
disambiguate
Figure 3: Disambiguating memory references.
function
redundant elimination
(op, from step)
begin
if
(INVARIANT(op))
then
status =
remove inv mem op
(op, from step)
/* if op was removed, return REMOVED */
foreach
memory operation, mem op,
in
to
if
(
disambiguate
(op, mem op) == EQUAL)
then
switch
op-mem op
case
load-load:
return
do load load opt
(op, mem op)
case
load-store:
status =
try load store opt
(op, mem op)
/* if op was removed, return REMOVED */
case
store-store:
/* If op's arg and mem op's arg have*/
/* the same reaching defs, delete op, */
/* and update necessary information.*/
return REMOVED
case
store-load:
return
ANTI-DEPENDENCE
end
end if
end
end
redundant elimination
Figure 4: Redundant elimination algorithm.
4 Reducing Memory Trac
Our solution to reducing the amount of memory traf-
c in HLS is to make explicit the redundancy in memory
interaction within the behavior and eliminate those ex-
traneous op erations. Our technique is employed during
scheduling rather than as a pre-pass or post-pass phase; a
pre-pass phase may not remove all redundancy since other
optimizations can create opportunities that may not have
otherwise existed while a p ost-pass phase cannot derive
as compact a schedule since operations eliminated on the
critical path allow further schedule renement.
4.1 Algorithm in Detail
Fig. 4 shows the main algorithm for removing unneces-
sary memory op erations. This function is invoked in our
Percolation-based scheduler [17] by the
move op
transform

(or any suitable lo cal co de motion routine in other sys-
tems) when moving a memory operation into a previous
step that contains other memory operations.
The function
redundant elimination
checks to see if the
memory operation is invariant. If so, then the function
remove inv mem op
tries to remove it. If it is not invari-
ant or could not be removed, then
op
is checked against
each memory operation in the previous step for possible
optimization. If two op erations refer to the same lo cation
then the appropriate action is taken dep ending up on their
types. The load-after-load and load-after-store cases will
be discussed shortly. In the case of a store-after-store, the
rst operation is dead and can be removed if it stores the
same argument as the second and the argument has the
same reaching denitions. Wechoose to simply remove
op
, rather than removing
mem op
and moving op into its
place. For the store-after-load, nothing is done as this is
a false (anti-) dependency that should be preserved for
correctness. Status reecting the outcome is returned, al-
lowing operations to continue to move if no redundancy
was found.
4.1.1 Removing Invariants
Removing invariant memory operations is slightly dier-
ent from general lo op invariant removal. Traditional loop
invariant removal moves an invariantinto a pre-lo op time
step. For load operations this is correct; for store op era-
tions it is not. Conceptually,invariant loads are \inputs"
to the loop, while invariant stores are \outputs." There-
fore, loads must b e placed in pre-loop steps and stores
must b e placed in lo op exit steps.
An algorithm to perform invariant removal appears in
Fig. 5. The conditions necessary for lo op invariant removal
(adapted from [1]) are: 1) the step that op is in must
dominate all lo op exits (i.e., op
must
be executed every
iteration), 2) only one denition of the variable (for loads)
or memory location (for stores) o ccurs in the lo op and 3) no
other denition of the variable or memory location reaches
their users. Additionally, store operations require that the
denition of its argument be the same at the loop exits
so that correctness is preserved. If these conditions are
met, then the operation can b e hoisted out of the lo op. If
condition 2 fails and the op eration is a load, it still mightbe
possible to hoist the operation if a register can be allocated
to the loaded value for the duration of the loop.
4.1.2 Load-After-Load Optimization
The load-after-load optimization is applied in situations
where a load operation accesses a memory value that has
been previously loaded and no intervening modication
has o ccurred to that lo cation's value (i.e. there is no inter-
mittent store). In Fig. 5 the load-after-load optimization is
found. The idea behind this optimization is to insert move
operations into nodes in
mem op
0
s
latency eld which will
transfer the value without re-loading it. As a matter of cor-
rectness, move op erations are only inserted into the no des
function
remove inv mem op
(op, from step)
begin
/* Conditions necessary for hoisting: */
/* 1. from step must dominate all exit nodes. */
/* 2. Only one denition exists. */
/* 3. No other defs reach users. */
/* 4. (stores) Defs of argument are same at loop exits. */
if
(/* conditions met */)
then
/* Move op to pre-loop steps if it's a load */
/* and all p ost-loop steps if it's a store. */
return
REMOVED
end if
return
NO OPT
end
remove inv mem op
function
do load load opt
(op, mem op)
begin
/* set eld to the nodes at the latency of mem op */
foreach
node
in
eld
if
(/* node is reachable byop*/)
then
/* Create move from mem op's arg to the */
/* arg of op. Add this move to no de. */
end if
/* Delete op and up date necessary information.*/
Return
REMOVED
end
do load load opt
Figure 5: Supp orting removal routines.
in
mem op
0
s
latency eld if
op
0
s
denition reaches those
nodes. Finally,
op
is deleted from the program graph and
the local information is up dated.
Although move op erations are introduced into the
schedule, the number of registers used does not increase
(a proof app ears in [9]).
4.1.3 Load-After-Store Optimization
The load-after-store optimization is used to remove a load
operation which accesses a value that a store op eration pre-
viously wrote to the memory. Due to limited resources it
is possible that this optimization cannot b e applied. Con-
sider the partial co de fragment:
Step 1:
a
[
i
]:=
b b
:=
Step 2:
c
:=
a
[
i
]
To eliminate the load
c
:=
a
[
i
], and replace it with the
move operation
c
:=
b
in step 2 would violate program se-
mantics because it introduces a
read-wrong
conict. The
move op eration must be placed in step 1 to guarantee cor-
rect results. However, in this code fragment:
Step 1:
a
[
i
]:=
b c
:=
Step 2:
c
:=
a
[
i
]
d
:=
f
(
c
)
placing a move op eration in step 1 will violate program
semantics b ecause it introduces a
write-live
conict|the
movemust be inserted into step 2. Notice that in b oth
cases, the transformation is still possible, analysis is re-
quired to determine which step is applicabl e.

function
try load store opt
(op, from step, mem op, to step)
begin
node = from step
if
(/* there is a read-wrong conict */)
then
node = to step
end if
if
(/* there is a write-live conict */)
then
if
(/* free cell exists*/)
then
/* Create move of mem op's arg to free cell. */
/* Add moveoptotostep. */
/* Create move of free cell to op's arg. */
/* Add move op to from step */
else
return
NO OPT
else
/* Create move of mem op's arg to op's arg. */
/* Add moveop to node*/
end if
/* Delete op and up date necessary information. */
return
REMOVED
end
try load store opt
Figure 6: Load-After-Store Algorithm.
This optimization might not b e feasible in the following
situation:
Step 1:
a
[
i
]:=
b b
:=
c
:=
Step 2:
c
:=
a
[
i
]
d
:=
f
(
c
)
Semantics are violated by placing
c
:=
b
into either time
step. However, if a free storage cell exists, then the opti-
mization can be done:
Step 1:
a
[
i
]:=
b b
:=
c
:=
e
:=
b
Step 2:
c
:=
e d
:=
f
(
c
)
Therefore, the precise case when the load-after-store opti-
mization fails to remove a redundant load is composed of
three conditions:
1. A move in this step results in a
read-wrong.
2. A move in the previous step results in a
write-live.
3. No free storage cell exists in the previous time step.
In practice, this situation occurs very infrequently.
The load-after-store optimization algorithm is found in
Fig. 6. This algorithm determines which step to place a
move operation. Initially, the step that
op
is in is tried. If
a
read-wrong
conict o ccurs, the previous step is tried. If a
write-live
conict arises, a free cell is necessary to transfer
the value. In this case, twomove operations are added to
the schedule. If a free cell is not available, no optimization
is done. If no conicts o ccur (or they can be alleviated by
switching steps) then a move operation is inserted. Finally,
the load operation is deleted and necessary information
updated.
4.2 Example
Applying our transformation to the b ehavior in the in-
troduction will eliminate the loads of
b[i][j-1]
and
b[i][j-
2]
and the invariant load and store to
a[i]
. During lo op
pipelini ng, the load of
b[i][j-2]
for iteration
j+1
is the same
as
b[i][j-1]
from iteration
j
since (
j
+1)
2=
j
1. For the
invariants the store cannot b e removed unless the load is
also removed. Once the load is hoisted, the store can then
be hoisted as well.
5 Exp eriments and Results
Four memory-intensive b enchmarks were used to study
our transformation: three numerical algorithms (prex
sums, tri-diagonal elimination and general linear recur-
rence equations) which are core routines in many algo-
rithms (as discussed in the introduction) adapated from
[10] and a two-dimensional hydrodynamics implicit com-
putation adapted from [20].
Latencies used for scheduling these b ehaviors were two
steps for add/subtract, three steps for multiply, and ve
steps for load/store. Also, the memory model adopted here
assumed that:
memory ports are homogenous,
each p ort has its own address calculator,
the memory is pip elined with no bank conicts.
With these assumptions, two experiments were conducted.
In the rst, schedules were generated with the number of
memory ports constrained between one and four and
no
functional unit (FU) constraints. Twoschedules were pro-
duced for each b enchmark with the sole dierence b etween
them the application of our transformation. The goal of
this exp erimentwas to isolate the dierence in transformed
schedules without the bias of FU constraints. In the sec-
ond experiment, schedules were generated with one to four
memory ports, two adder units and one multiplier unit.
This experimentwas designed to study performance in the
presence of realistic FU resources.
For each exp eriment, the number of steps in the sched-
ule of the innermost lo op was counted. The GLR equations
benchmark (marked with a
?
) has two loops at the same
innermost nesting level; the results indicate the summa-
tion of the number of steps in both loops. The results
of exp eriments one and two are found in Tables 1 and 2,
respectively. The column labelled \RE" indicates applica-
tion of our transformation. The columns collectively la-
belled \Number of Ports" contain the number of steps in
the innermost loop for the respective FU and memory p ort
parameters.
The results for exp eriment one (Table 1) demonstrate
that this optimization considerably reduces the number
of cycles for the inner lo op. In the prex sums and tri-
diagonal elimination benchmarks, a performance limited
by the latency of a load is achieved with a sucientnum-
ber of p orts. Since a latency of 5 cycles was used for
load operations and not all loads can be eliminated, the
schedule length cannot b e any shorter. The same char-
acteristic is exhibited by the GLR equations b enchmark,
although computational latency causes a longer schedule
length while the hydrodynamics benchmark exhibits im-

Citations
More filters
Journal ArticleDOI
TL;DR: A survey of the state-of-the-art techniques used in performing data and memory-related optimizations in embedded systems, covering a broad spectrum of optimization techniques that address memory architectures at varying levels of granularity.
Abstract: We present a survey of the state-of-the-art techniques used in performing data and memory-related optimizations in embedded systems. The optimizations are targeted directly or indirectly at the memory subsystem, and impact one or more out of three important cost metrics: area, performance, and power dissipation of the resulting implementation.We first examine architecture-independent optimizations in the form of code transoformations. We next cover a broad spectrum of optimization techniques that address memory architectures at varying levels of granularity, ranging from register files to on-chip memory, data caches, and dynamic memory (DRAM). We end with memory addressing related issues.

405 citations


Cites methods from "Minimization of Memory Traffic in H..."

  • ...This is illustrated by the work on local loop transformations to reduce the memory access in procedural descriptions [Kolson et al. 1994]; the work on multidimensional loop scheduling for buffer reduction [Passos et al. 1995]; and the PHIDEO project where loop transformations on periodic streams…...

    [...]

Proceedings ArticleDOI
22 Jun 2001
TL;DR: A compiler-controlled dynamic on-chip scratch-pad memory (SPM) management framework that uses both loop and data transformations is proposed that indicates significant reductions in data transfer activity between SPM and off-chip memory.
Abstract: Optimizations aimed at improving the efficiency of on-chip memories are extremely important. We propose a compiler-controlled dynamic on-chip scratch-pad memory (SPM) management framework that uses both loop and data transformations. Experimental results obtained using a generic cost model indicate significant reductions in data transfer activity between SPM and off-chip memory.

296 citations

Proceedings ArticleDOI
10 Jun 2002
TL;DR: An optimization algorithm is proposed that targets the reduction of extra off-chip memory accesses caused by inter-processor communication by increasing the application-wide reuse of data that resides in the scratch-pad memories of processors.
Abstract: In this paper, we present a compiler strategy to optimize data accesses in regular array-intensive applications running on embedded multiprocessor environments. Specifically, we propose an optimization algorithm that targets the reduction of extra off-chip memory accesses caused by inter-processor communication. This is achieved by increasing the application-wide reuse of data that resides in the scratch-pad memories of processors. Our experimental results obtained on four array-intensive image processing applications indicate that exploiting inter-processor data sharing can reduce the energy-delay product by as much as 33.8% (and 24.3% on average) on a four-processor embedded system. The results also show that the proposed strategy is robust in the sense that it gives consistently good results over a wide range of several architectural parameters.

97 citations

Proceedings ArticleDOI
02 Oct 1995
TL;DR: In this paper, several DSP system design principles are presented which are valid for a large class of memory-intensive algorithms, and it is shown that for this class of applications, compile-time data caching decisions not only have a large effect on the performance, but also can have an even larger impact on the overall system cost and power consumption.
Abstract: In this paper, several DSP system design principles are presented which are valid for a large class of memory-intensive algorithms Our main focus lies on the optimization of the memory and I/O, since these are dominant cost factors in the domain of video and imaging applications This has resulted in several formalizable mapping principles, which allow to prevent the memory from becoming a bottleneck First, it as shown that for this class of applications, compile-time data caching decisions not only have a large effect on the performance, but also can have an even larger effect on the overall system cost and power consumption This is illustrated by means of experiments in which the whole range of no cache up to large cache sizes is scanned Next, it is shown that when enforcing constant I/O rates to reduce buffer sizes, the area gain may be far more important than the small performance decrease associated with it A technique to achieve this in an efficient way is proposed The main test-vehicle which is used throughout the paper to demonstrate our approach is the class of motion estimation type algorithms

25 citations

Journal ArticleDOI
TL;DR: A new transformation for the scheduling of memory-access operations in high-level synthesis that removes load and store instructions which become redundant or unnecessary during the transformation of loops, resulting in a significant reduction in the number of memory operations.
Abstract: This paper presents a new transformation for the scheduling of memory-access operations in high-level synthesis. This transformation is suited to memory-intensive applications with synthesized designs containing a secondary store accessed by explicit instructions. Such memory-intensive behaviors are commonly observed in video compression, image convolution, hydrodynamics and mechatronics. Our transformation removes load and store instructions which become redundant or unnecessary during the transformation of loops. The advantage of this reduction is the decrease of secondary memory bandwidth demands. This technique is implemented in our Percolation-Based Scheduler which we used to conduct experiments on a suite of memory-intensive benchmarks. Our results demonstrate a significant reduction in the number of memory operations and an increase in performance on these benchmarks.

22 citations

References
More filters
Book
01 Jan 1993
TL;DR: This book provides the general mathematical background needed for loop transformations (including those basic mathematical algorithms), discusses data dependence, and introduces the major transformations.
Abstract: Automatic transformation of a sequential program into a parallel form is a subject that presents a great intellectual challenge and promises great practical rewards. There is a tremendous investment in existing sequential programs, and scientists and engineers continue to write their application programs in sequential languages (primarily in Fortran),but the demand for increasing speed is constant. The job of a restructuring compiler is to discover the dependence structure of a given program and transform the program in a way that is consistent with both that dependence structure and the characteristics of the given machine. Much attention in this field of research has been focused on the Fortran do loop. This is where one expects to find major chunks of computation that need to be performed repeatedly for different values of the index variable. Many loop transformations have been designed over the years, and several of them can be found in any parallelizing compiler currently in use in industry or at a university research facility. Loop Transformations for Restructuring Compilers: The Foundations provides a rigorous theory of loop transformations. The transformations are developed in a consistent mathematical framework using objects like directed graphs, matrices and linear equations. The algorithms that implement the transformations can then be precisely described in terms of certain abstract mathematical algorithms. The book provides the general mathematical background needed for loop transformations (including those basic mathematical algorithms), discusses data dependence, and introduces the major transformations. The next volume will build a detailed theory of loop transformations based on the material developed here. Loop Transformations for Restructuring Compilers: The Foundations presents a theory of loop transformations that is rigorous and yet reader-friendly.

253 citations


"Minimization of Memory Traffic in H..." refers methods in this paper

  • ...In our scheduler, memory disambiguation is based on the well-known greatest common divisor, or GCD test [2]....

    [...]

Journal ArticleDOI
TL;DR: A metric to estimate the optimal execution time of DO loops on particular processors is described, parameterized by the memory bandwidth and peak floating-point rate of the processor, as well as the length of the pipelines used in the functional units.

154 citations


"Minimization of Memory Traffic in H..." refers background in this paper

  • ...Previous work in reducing memory accessing balances load operations with computation [3]....

    [...]

Proceedings ArticleDOI
24 Jun 1990
TL;DR: It is argued that starting from an optimal schedule is one of the most important factors in scheduling because it offers the user flexibility to tune the heuristics and gives him a good bound for the resource constrained schedule.
Abstract: A new approach called Percolation Based Synthesis for the scheduling phase of High Level Synthesis (HLS) is presented. We discuss some new techniques (which are implemented in our tools) for compaction of flow graphs beyond basic blocks limits, which can produce order of magnitude speed ups versus serial execution. Our algorithm applies to programs with conditional jumps, loops and multicycle pipelined operations. In order to schedule under resource constraints we start by first finding the optimal schedule (without constraints) and then add heuristics to map the optimal schedule onto the given system. We argue that starting from an optimal schedule is one of the most important factors in scheduling because it offers the user flexibility to tune the heuristics and gives him a good bound for the resource constrained schedule. This scheduling algorithm is integrated with synthesis tool which uses VHDL as input description and produces a structural netlist of generic register-transfer components and a unit based control table as output. We show that our algorithm obtains better results than previously published algorithms.

125 citations


"Minimization of Memory Traffic in H..." refers background or methods in this paper

  • ...This function is invoked in our Percolation-based scheduler [17] by the move op transform...

    [...]

  • ...During loop pipelining [17, 13] redundancy is exhibited when values loaded from and/or stored to the memory in one iteration are loaded from and/or stored to the memory in future iterations....

    [...]

Journal ArticleDOI
TL;DR: LAGER is an integrated computer-aided design system for algorithm-specific integrated circuit design, targeted at applications such as speech processing, image processing, telecommunications, and robot control, and allows easy integration of novel CAD tools.
Abstract: LAGER is an integrated computer-aided design system for algorithm-specific integrated circuit design, targeted at applications such as speech processing, image processing, telecommunications, and robot control. LAGER provides user interfaces at behavioral, structural, and physical levels and allows easy integration of novel CAD tools. LAGER consists of a behavioral mapper and a silicon assembler. The behavioral mapper maps the behavior onto a parameterized structure to produce microcode and parameter values. The silicon assembler then translates the filled-out structural description into a physical layout, and, with the aid of simulation tools, the user can fine tune the data path by iterating this process. The silicon assembler can also be used without the behavioral mapper for high-sample-rate applications. A number of algorithm-specific ICs designed with LAGER have been fabricated and tested, and as examples, a robot arm controller chip and a real-time image segmentation chip are described. >

104 citations


"Minimization of Memory Traffic in H..." refers background in this paper

  • ...Traditionally, one of the goals in High-Level Synthesis (HLS) is the minimization of storage requirements for synthesized designs [5, 7, 15, 19]....

    [...]

Proceedings ArticleDOI
07 Nov 1993
TL;DR: A multiport memory allocation problem for hierarchical, i.e. multi-dimensional, data streams is described and a memory allocation algorithm is presented which only considers interconnect costs, but memory size and other cost factors can be taken into account.
Abstract: A multiport memory allocation problem for hierarchical, i.e. multi-dimensional, data streams is described. Memory allocation techniques are used in high level synthesis for foreground and background memory allocation, the design of data format converters, and the design of synchronous inter-processor communication hardware. The techniques presented in this paper differ from other approaches in the sense that data streams are considered to be design entities and are not expanded to individual samples. A formal model for hierarchical data streams is given and a memory allocation algorithm is presented. The algorithm comprises two steps: data routing and assignment of signal delays to memories. A number of sub-problems are formulated as ILP programs. In the presented form, the allocation algorithm only considers interconnect costs, but memory size and other cost factors can be taken into account. The presented work is implemented in the memory allocation tool MEDEA which is part of the PHIDEO synthesis system.

99 citations


"Minimization of Memory Traffic in H..." refers background in this paper

  • ...Related work includes the minimization of registers [8, 14], the minimization of the interconnection between secondary-store and functional units [11], and the assignment of arrays to storage [18]....

    [...]

Frequently Asked Questions (2)
Q1. What contributions have the authors mentioned in the paper "Minimization of memory tra c in high-level synthesis" ?

This paper presents a new transformation for the scheduling of memory-access operations in High-Level Synthesis. This technique is implemented in their Percolation-Based Scheduler which the authors used to conduct experiments on core numerical benchmarks. 

Future work will address this interaction.