What are the contributions in this paper?

This paper presents pipeline vectorization, a method for synthesizing hardware pipelines based on software vectorizing compilers. The authors propose several loop transformations to customize pipelines to meet hardware resource constraints while maximizing available parallelism. Their approach is especially effective for highly repetitive computations in digital signal processor ( DSP ) and multimedia applications. Case studies using field programmable gate arrays ( FPGAs ) -based platforms are presented to demonstrate the benefits of their approach and to evaluate tradeoffs between alternative implementations.

What are the future works in this paper?

Future work will include combining the fine-grain vectorization presented in this paper with coarse-grain task-level parallelism. Strategies to transform entire loop nests will also be studied and automatic partitioning will be included in their compiler prototype. Further extensions will allow users to include manually designed hardware blocks and to synthesize digit-serial designs.

What is the purpose of loop unrolling?

In software compilers, loop unrolling is an important technique to increase basic block sizes, extending the scope of local optimizations.

How many cycles are needed to perform all vector accesses to external memory?

The pipeline cycle must contain at least clock cycles, where is the number of clock cycles needed to perform all vector accesses to external memory required for one loop iteration.

What is the effect of hardware sharing on the resulting circuit?

hardware sharing may increase the amount of routing to the shared resource, increasing both delay and size of the resulting circuit.

How do you store the input and output values in the pipeline?

By storing them in registers, all input and output values are presented to the pipeline synchronously at the beginning of a pipeline cycle.

How many smaller RTR processing elements can be used?

The XC 6216 is large enough to implement the controller and 54 CTR processing elements or 90 smaller specialized RTR processing elements.

What is the way to unroll a loop?

In these cases, it is very beneficial to partially unroll a loop, thereby adjusting the circuit size to the given hardware resources, and vectorize the next outer loop.

What is the simplest way to generate a one-hot controller?

For FPGA implementations where there are abundant latches, the authors generate a one-hot controller triggered by an external START signal.

What is the advantage of loop merging?

The last line in Table II shows that the advantage of loop merging is limited for one memory bank since too many memory accesses have to be performed sequentially in one cycle.

What is the process of initializing the pipeline loop’s index variable?

the controller initializes the pipeline loop’s index variable and then repeatedly loops through cycles to complete a pipeline cycle.

How long does the assignment take to complete?

If the resulting delay of an assignment becomes larger than , the clock-cycle time of the pipelined circuit, the assignment is performed in several cycles.

(Open Access) Pipeline vectorization (2001) | Markus Weinhardt

Q: What is the effect of a regular loop on the dataflow graph?

If a loop has regular loop-carried dependences, the dataflow graph must be extended to use the correct values upon which a computation depends.

Q: What is the dependence distance for loop-carried dependences?

For loop-carried dependences, the dependence distance is the number of iterations between the statements that cause the dependence.

234 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2001

Pipeline Vectorization

Markus Weinhardt and Wayne Luk, Member, IEEE

Abstract—This paper presents pipeline vectorization, a method

for synthesizing hardware pipelines based on software vector-

izing compilers. The method improves efficiency and ease of

development of hardware designs, particularly for users with

little electronics design experience. We propose several loop

transformations to customize pipelines to meet hardware resource

constraints while maximizing available parallelism. For runtime

reconfigurable systems, we apply hardware specialization to

increase circuit utilization. Our approach is especially effective

for highly repetitive computations in digital signal processor

(DSP) and multimedia applications. Case studies using field pro-

grammable gate arrays (FPGAs)-based platforms are presented to

demonstrate the benefits of our approach and to evaluate tradeoffs

between alternative implementations. For instance, the loop-tiling

transformation, has been found to improve vectorization perfor-

mance 30–40 times above a PC-based software implementation,

depending on whether runtime reconfiguration (RTR) is used.

Index Terms—High-level synthesis, parallelization, pipelining,

reconfigurable computing, vectorization.

I. INTRODUCTION

ANY application developers recognize that the key to ef-

fective use of custom computing systems is to maximize

their available parallelism. This task, which has to be achieved

while meeting specific hardware resource constraints, is diffi-

cult to perform by hand.

Vectorizing compilers have proved successful in detecting

and exploiting parallelism for conventional processors with a

fixed architecture. A vector execution unit adapted for digital

signal processor (DSP) and multimedia processinghasalsobeen

identified as an important component of novel computer archi-

tectures such as the vector IRAM [1]. This paper presents an ap-

proach for automatically producing optimized pipelined circuits

from a high-level program using techniques derived from soft-

ware vectorizing compilers. The compile-time and runtime re-

configurability of field programmable gate arrays (FPGAs) can

also be efficiently exploited.

Our approach, which we call pipeline vectorization [2]–[4],

essentially involves the synthesis of pipelined processors that

executeinner loops of programs. Data dependence analysis sim-

ilar to software vectorization is performed, which determines if

a pipeline can be generated for a loop. Therefore, it generates

circuits that exhibit more parallelism than many other automatic

high-level hardware design tools.

Manuscript received February 29, 1999; revised April 6, 2000. This work

was supported by a European Union training project financed by the Commis-

sion in the TMR program, the U.K. Engineering and Physical Sciences Re-

search Council, Embedded Solutions Ltd., and Xilinx Inc. This paper was rec-

ommended by Associate Editor R. Camposano.

The authors are with the Department of Computing, Imperial Col-

lege, London SW7 2BZ, U.K. (e-mail: m.weinhardt@computer.org;

wl@doc.ic.ac.uk).

Publisher Item Identifier S 0278-0070(01)00943-5.

There are significant differences between pipeline vector-

ization and software vectorization. For instance, our approach

covers a wider range of loops since it does not consider

out-of-order execution. It can be used with a variety of storage

allocation methods. In contrast to software vectorization, we do

not explicitly generate vector instructions. Instead, all instruc-

tions of the loop body are vectorized and chained by pipelining

input data through the entire dataflow graph synthesized from

the loop body. To widen the applicability of our technique, we

devise several loop transformations that adjust the amount of

hardware used in vectorized loops to the available hardware

resources. For reconfigurable implementations, we explore

methods to increase circuit utilization by runtime circuit

specialization and runtime reconfiguration (RTR).

Our approach includes the synthesis of nonpipelined circuitry

for nonvectorizable loops and conditional and sequential pro-

gram code. It can be used in two modes—hardware mode and

codesign mode. In hardware mode, a processor is generated for

the entire program (which includes only synthesizable opera-

tions as defined in Section III-A1), rendering descriptions from

a high-level sequential programming language into an efficient

hardware description language (HDL). Alternatively, in code-

sign mode, parts of the program (such as nonsynthesizable or

highly irregular parts) remain in software to be executed on

a host microprocessor. This mode results in a hardware-soft-

ware codesign system with data and control transfer between

host processor and custom hardware automatically being imple-

mented.

This paper is organized as follows. First, we discuss relevant

previous work. Section III then presents the core pipeline vec-

torization design flow, the main contribution of this paper. Next,

Sections IV and V describe the other important contributions:

optimizing loop transformations and runtime circuit specializa-

tion. Section VI reports on a prototype compiler implementa-

tion and Section VII provides case studies and results evalu-

ating pipeline vectorization. Finally, Section VIII presents con-

clusions and future work.

II. B

ACKGROUND

Increasingly, system description is written in a high-level

software language [5]. This method simplifies algorithm

development and facilitates experiments to map different

components into hardware. Our approach not only supports this

method, but it goes a step further by providing a framework in

which efficient hardware can be produced automatically from a

software description. It is particularly suited for reconfigurable

systems, which have been shown to be useful in various appli-

cation domains [6]. We attempt to simplify the programming

of these systems, which typically consist of a host processor

and FPGA hardware boards [7], [8]. Programming difficulties

Authorized licensed use limited to: Imperial College London. Downloaded on August 13,2010 at 17:10:28 UTC from IEEE Xplore. Restrictions apply.

WEINHARDT AND LUK: PIPELINE VECTORIZATION 235

include analyzing the tradeoffs between software and hard-

ware, designing software and hardware parts using different

languages and tools and debugging the interface between these

parts.

Some tools approach these problems by using a programming

language input to specify both software and hardware in a uni-

form manner. This enables systematic analysis of the tradeoffs

as well as automatic synthesis of the interface between software

and hardware. However, while a sequential program is a nat-

ural specification for the host processor, hardware coprocessors

synthesized from sequential code often fail to exploit the hard-

ware’s parallelism sufficiently. For instance, this is the case in

the PRISM system [9]. The configuration and communication

overhead is often larger than the achieved speedup itself. Better

results can be obtained for systems that integrate a micropro-

cessor core and reconfigurable hardware on a chip. An example

of this approach is the Garp chip [10] and its

-based compiler

[11].

Guccione adopts data-parallel

vector operations on data

streams to describe pipelined circuits [12], [13]. However, this

method requires the user to learn a new programming language

that is only used for the hardware part of an application. The

same is true for Transmogrifier

[14], a research compiler al-

lowing only task-level parallelism.

Hardware programming systems based on communicating

sequential processes such as OCCAM [15] and Handel-C

[16] are suitable for control intensive applications. However,

the user has to specify parallel operations explicitly. As for

data-parallel

, the software parts of an application must be

written in a different language and interfaced manually to the

hardware parts.

In the application-specific integrated circuit domain,

high-level synthesis systems generate register-transfer struc-

tures from behavioral (algorithmic) specifications [17]–[19].

These methods employ sophisticated scheduling, resource

allocation, and binding techniques for general processor archi-

tectures. They perform an exhaustive design-space exploration,

which makes the tools very slow, especially if compared to a

software compiler. A commercial high-level synthesis system

is Synopsys’ Behavioral Compiler [20], which can handle array

data and generate memory accesses. However, to pipeline loops,

the user has to analyze loop-carried dependences manually (as

defined in Section III-A3) and specify a safe initiation interval,

which preserves the dependences and the original order of

memory reads and writes. Another system, C2Verilog, [21]

uses ANSI

to produce a Verilog register-transfer structure

using high-level synthesis techniques. However, it does not

perform loop pipelining.

Several research projects address the automatic synthesis of

pipelined circuits from program loops. The closest to our ap-

proach is the NAPA

compiler [22]. However, that system tar-

gets specificallytheNAPA processor [23] and considers only in-

nermost loops. No automatic vectorization or optimizing trans-

formations similar to ours are reported. The scheduling of in-

structions is performed on an atomic basis and, thus, is less

flexible than hardware pipelining, which can also use internally

pipelined operators. Finally, a loop parallelization method is

Fig. 1. Core design flow.

also included in the Alpha System [24]. However, it is restricted

to producing linear systolic arrays whereas our techniques can

vectorize more general programs.

III. C

ORE DESIGN FLOW

We first present the core pipeline-vectorization design flow,

as shown in Fig. 1. Most parts of the figure are relevant for both

the codesign and the hardware mode. Only the software branch

on the bottom left side and the hardware–software partitioning

phase do not exist for the hardware mode. The core design flow

consists of three major phases: preprocessing, hardware syn-

thesis, and partitioning and integration. They are described in

the following sections. The extension of the core design flow to

include loop transformations will be covered in Section IV.

A. Preprocessing

Preprocessing consists of four steps: hardware candidate se-

lection (Section III-A1), loop normalization (Section III-A2),

Authorized licensed use limited to: Imperial College London. Downloaded on August 13,2010 at 17:10:28 UTC from IEEE Xplore. Restrictions apply.

236 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2001

dependence analysis (Section III-A3), and removing vector de-

pendences (Section III-A4). These steps are necessary to per-

form hardware synthesis effectively.

1) Hardware Candidate Selection: This step selects pro-

gram parts suitable for hardware synthesis. Regular iterative

computations, which perform identical operations on a large

set of data. are likely to achieve high performance in hardware.

Hence, loops are natural candidates for hardware processors.

We attempt to vectorize innermost FOR loops and generate

pipelines for them. FOR loops have an induction (index)

variable necessary for normalization and vectorization and

have predetermined loop counts. Thus, they can be handled

by efficient control circuitry. It is possible to transform some

WHILE loops to FOR loops by induction variable detection

[25]. Other WHILE loops, outer loops, and other program

constructs are nonpipeline candidates and, therefore, are

unlikely to result in fast efficient hardware. They should only

be considered in combination with pipeline candidates or left

for software execution. However, our procedure will consider

all loops since the loop transformations presented in Section IV

rearrange loop nests.

There are some additional restrictions for the candidates: they

must not contain nonsynthesizable operations such as recursive

function calls, external operating system calls, or library calls.

In hardware mode, programs containing any nonsynthesiz-

able operations are not considered legal input. Thus, the entire

legal input program is a candidate and candidate selection only

distinguishes between pipeline and nonpipeline candidates.

The example edge detector program in Fig. 2 will be used to

synthesize a pipeline circuit. Its inner loop is a pipeline candi-

date and its outer loop (by definition) a nonpipeline candidate.

We use

language syntax here because our compiler prototype

(Section VI) uses a

front end. Our approach is valid for any

sequential imperative programming language.

2) Loop Normalization: For vectorization, we have to nor-

malize the pipeline candidate loops by the following transfor-

mations.

• Remove all additional induction variables and normalize

the loop’s lower bound to zero and its step to one (induc-

tion variable substitution [26]).

• Normalize the index expressions to linear expressions of

the induction variable (subscript normalization [26]). For

induction variable

, the resulting expression has the form

. is called the access’ stride.

If one or more index expressions cannot be normalized, the

loop is only a nonpipeline candidate. In particular, indirect array

accesses, where array elements are accessed by intermediate re-

sults prevent vectorization as in software vectorizing compilers

[26].

Normalizing the inner candidate loop in Fig. 2 will create the

loop header

and substitute by in the loop body.

Nonrecursive function calls can be inlined. Therefore, we assume—without

loss of generality—that no function calls exist in the candidates.

Fig. 2. Edge detector program.

3) Dependence Analysis: The next processing stage ana-

lyzes pipeline candidate loops for dependences. There are three

general types of dependences [27]. True or flow dependence

occurs when a variable is assigned or defined in one statement

and used in a subsequently executed statement. Antidependence

occurs when a variable is used in one statement and reassigned

in a subsequently executed statement. Output dependence oc-

curs when a variable is assigned in one statement and reassigned

in a subsequently executed statement. General dependences

are either loop independent or loop carried. The former occurs

between statements in the same loop iteration and the latter

between statements in different iterations. For loop-carried de-

pendences, the dependence distance is the number of iterations

between the statements that cause the dependence. In a loop

nest, we determine for each loop hierarchy the loop-carried

dependences since only these affect the loop-level parallelism.

Since pipeline execution overlaps the loop iterations but

maintains their order, memory writes are never out of order.

Hence, we only have to consider true dependences but not

anti or output dependences. Therefore, pipeline vectorization

applies to more loops than software vectorization. We utilize

standard dependence analysis methods [27] to detect these

dependences. Unfortunately, these methods are not completely

accurate. In some cases, they cannot determine the absence of

dependences, thus failing to detect potential parallelism.

Next, we check if the detected true loop-carried dependences

occur in all loop iterations with the same dependence distance.

We call these dependences regular. All dependences stemming

from scalar variables and from array accesses with the same

stride are regular [3]. It is possible to synthesize hardware

obeying these dependences, but the resulting circuits may

contain feedback cycles. In contrast to software vectorization,

regular dependences do not prevent pipeline synthesis, although

they can reduce parallelism because the feedback paths restrict

the speedup achieved by pipelining in a later processing stage.

Irregular dependences can be handled provided that the orig-

inal order of read and write accesses of the arrays involved

are maintained. However, this usually requires many sequential

memory accesses and is only feasible with very fast memories

such as on-chip memories.

In the program in Fig. 2, there are no dependences from array

accesses: array

is only read and array is only written.

Authorized licensed use limited to: Imperial College London. Downloaded on August 13,2010 at 17:10:28 UTC from IEEE Xplore. Restrictions apply.

WEINHARDT AND LUK: PIPELINE VECTORIZATION 237

Fig. 3. Fibonacci numbers program.

The followingexample showsthatour approach can deal with

loops not usually vectorized by software vectorizing compilers.

The loop

has a loop-carried dependence stemming from the assignment to

array

but no value written to memory is read in a subsequent

loop iteration. Only out-of-order execution of the assignments

would lead to a real dependence. Hence, it is an antidependence

and we can disregard it and vectorize the loop.

To the contrary, the program computing the Fibonacci num-

bers in Fig. 3 contains true loop-carried dependences. The as-

signment to

depends on the two previous assignments.

Since both dependences are regular (dependencedistances1 and

2), we can generate a pipeline circuit for this program. But the

dependences have to be handled as described in the next para-

graph.

4) Removing Vector Dependences: The hardware synthesis

technique presented in the next section cannot handle true loop-

carried dependences stemming from array (vector) accesses, as

shown in Fig. 3. Therefore, pipeline candidate loops have to be

transformed to remove them in the following way. First, vector

accesses depending on earlier iterations are substituted by new

scalar variables in the candidate loop body. Next, at the end of

the loop body, instructions are inserted that assign these vari-

ables to the values they depend on in the original program. For

dependences with dependence distances larger than one, addi-

tional variables and assignments are inserted. Finally, assign-

ments to initialize the variables are added before the loop. For

the Fibonacci number program, accesses

and are

substituted. Fig. 4 shows the resulting transformed program. It

only contains dependences stemming from the new scalar vari-

ables

and . In every iteration, their values from the

previous iteration are read. These dependences will be handled

by the hardware synthesis phase.

B. Hardware Synthesis

The hardware synthesis phase contains three steps:

dataflow graph generation (Section III-B1) and extension

(Section III-B2), pipelining (Section III-B3), and controller

synthesis (Section III-B4).

For those candidate loops that pass the dependence test, in-

dependent pipeline circuits are synthesized. They are later inte-

grated in largerdesigns or instantiated in separate configurations

(see Section III-C).

Various storage allocation schemes can be used. For instance,

scalar variables can be held in hardware registers and arrays

can be stored on off-chip memory. On some FPGA families,

small arrays of data can also be stored in very fast on-chip

Fig. 4. Transformed Fibonacci program.

memory. Array elements are fed to the pipeline as continuous

data streams through vector inputs and output streams are

written back to local memory through vector outputs. In this

way, one loop iteration is executed every pipeline cycle. All

element addresses for the linear array accesses can be computed

in parallel with the loop computations. Thus, they do not slow

down the application. However, for arbitrary accesses, address

computations depend on loop computation results and must

be scheduled accordingly, thereby slowing down the circuits.

Pointer accesses are indirect accesses to the entire host memory

space and are only possible in tightly coupled architectures

with direct memory access (DMA).

We first generate an acyclic dataflow graph for the loop body.

Next, regular loop-carried dependences are resolved, possibly

introducing feedback cycles. Finally, the circuit is pipelined and

a controller is synthesized.

1) Dataflow Graph Generation: We generate an acyclic

combinational dataflow graph for the loop body by analyzing

its internal dependences and allocating a new operator for each

operation in the program’s expressions. This simple “direct

compilation” shares no resources within the loop body but

later allows overlapping loop iterations by pipelining. We treat

array accesses and scalar variables uniformly. Since the loop

body can only contain linear code and conditional statements,

we can use a control flow/data flow transformation [19] to

generate one combined dataflow graph for the entire loop body.

It computes all program branches in parallel and uses multi-

plexers to select the correct values of conditionally assigned

variables. Resources in these mutually exclusive paths can

be shared without interfering with pipelining (cf. the PISYN

system [18]). For instance, an adder and a subtractor with the

same inputs can be replaced by a combined adder/subtractor if

their outputs are not required concurrently. Our dataflow graph

generation is similar to the method used in the Transmogrifier

compiler [14], but our method avoids unnecessary memory

accesses: when an input value remains unchanged in one branch

of a conditional statement, we do not read the old value in and

write the unchanged value back. Instead, write-enable signals

are generated for the RAM accesses to write values only if the

appropriate conditions are met.

To further reduce redundant memory accesses, index-shifted

accesses to the same array are combined and realized by shift

registers [28]. Using these delayed values of the input stream

avoids accessing the same value in memory more than once and

reduces the number of required vector inputs. This reduction is

crucial since all vector input streams must be read and all output

streams written once for every loop iteration. Thus, the pipeline

Authorized licensed use limited to: Imperial College London. Downloaded on August 13,2010 at 17:10:28 UTC from IEEE Xplore. Restrictions apply.

238 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2001

Fig. 5. Edge detector dataflow graph.

Fig. 6. Incomplete Fibonacci dataflow graph.

throughput directly depends on the number of vector inputs and

outputs.

Fig. 5 showsthe dataflowgraph for the edge detector program

from Fig. 2. There are three vector inputs for

, ,

and

and a vector output for . The shift registers

are represented by delay elements

. Note that the statement

of the loop body is implemented by a multiplexer, which either

selects the conditionally assigned value (

) or the unchanged

value of

generated in the previous statement.

The dataflow graph generated for the transformed Fibonacci

program in Fig. 4 is shown in Fig. 6. Obviously, this circuit does

not produce correct outputs since the loop-carried dependences

stemming from variables

and are not accounted for.

The registers are only initialized but never updated. The next

section shows how the circuit can be altered to produce correct

output.

2) Dependences and Feedback Cycles: If a loop has reg-

ular loop-carried dependences, the dataflow graph must be ex-

tended to use the correct values upon which a computation de-

pends. The transformation in Section III-A4 substituted depen-

dent array accesses by scalar variables. Thus, all loop-carried

dependences remaining in a pipeline candidate stem from scalar

variables. Such dependences are treated in the following way.

Since one loop iteration is executed every pipeline cycle, the

input register of such a variable (which is read and written in

the loop) must always contain the value computed in the pre-

vious pipeline cycle. To achieve this, a multiplexer is added at

the register’s input. It selects the input value during initializa-

tion and the feedback value during normal operation depending

on an external control signal provided by the environment.

Fig. 7 shows the result for the Fibonacci program in Fig. 4.

For

and , multiplexers have been inserted between

the inputs and the registers storing the variables. During initial-

ization, the control signal

selects the input values

that are used for the first loop iteration. All subsequent iterations

select the other inputs of the multiplexers that are connected to

Fig. 7. Complete Fibonacci dataflow graph.

the output values of the variables in the previous iteration. In

this example, a feedback cycle from the output of the adder to

the register holding

is created (bold lines in Fig. 7). How-

ever, not all dependences result in feedback cycles.

This example shows that the feedback operators synthe-

sized by pipeline vectorization are more general than those

available in single-instruction multiple-data (SIMD) parallel

programming languages [27]. Such languages feature special

REDUCE or SCAN operations but they are limited to single

operators with direct output feedback to one input (for instance,

ADD-SCAN for an accumulator). The same is true for software

vectorizing compilers, which extract these operations. Arbi-

trarily, customized feedback units, as the one shown in Fig. 7,

are only possible when a pipeline is implemented in hardware.

3) Pipelining and Timing: So far, we have generated a

dataflow graph that computes one loop iteration once all input

registers are set. It may not be very efficient because the

combinational delays of chained operators may accumulate to

a long critical path. The critical path delay can be reduced by

pipelining, effectively overlapping different loop iterations, and

thereby improving the performance. Although the latency is

also increased, it often has only a minimal effect since the time

for filling and flushing the pipeline is normally negligible.

Theoretically, it is possible to pipeline an acyclic dataflow

graph very deeply and run it at a very high clock speed. In a

practical implementation, however, the system clock cycle

restricted by the combinational delay of the controller (cf. Sec-

tion III-B4). Since most pipelines are fed by data from external

memory, we also require that an external memory access (as-

suming synchronous RAM) completes in one cycle. Hence, we

choose an appropriate value for

for each target architecture,

cf. Section VI-C.

Authorized licensed use limited to: Imperial College London. Downloaded on August 13,2010 at 17:10:28 UTC from IEEE Xplore. Restrictions apply.

Pipeline vectorization

Figures

Citations

Reconfigurable computing: architectures and design methods

PACT XPP—A Self-Reconfigurable Data Processing Architecture

Process for automatic dynamic reloading of data flow processors (dfps) and units with two-or-three-dimensional programmable cell architectures (fpgas, dpgas, and the like)

Data processing device and method

An overview of reconfigurable hardware in embedded systems

References

High-Performance Compilers for Parallel Computing

High ― Level Synthesis: Introduction to Chip and System Design

Garp: a MIPS processor with a reconfigurable coprocessor

Supercompilers for parallel and vector computers

A loop transformation theory and an algorithm to maximize parallelism

Related Papers (5)

The Garp architecture and C compiler

SPARK: a high-level synthesis framework for applying parallelizing compiler transformations

Garp: a MIPS processor with a reconfigurable coprocessor

A decade of reconfigurable computing: a visionary retrospective

MorphoSys: an integrated reconfigurable system for data-parallel and computation-intensive applications

Frequently Asked Questions (14)

Q1. What are the contributions in this paper?

Q2. What are the future works in this paper?

Q3. What is the purpose of loop unrolling?

Q4. How many cycles are needed to perform all vector accesses to external memory?

Q5. What is the effect of a regular loop on the dataflow graph?

Q6. What is the effect of hardware sharing on the resulting circuit?

Q7. How do you store the input and output values in the pipeline?

Q8. What is the dependence distance for loop-carried dependences?

Q9. How many smaller RTR processing elements can be used?

Q10. What is the way to unroll a loop?

Q11. What is the simplest way to generate a one-hot controller?

Q12. What is the advantage of loop merging?

Q13. What is the process of initializing the pipeline loop’s index variable?

Q14. How long does the assignment take to complete?