scispace - formally typeset
Open AccessJournal ArticleDOI

Pipeline vectorization

Reads0
Chats0
TLDR
This paper presents pipeline vectorization, a method for synthesizing hardware pipelines based on software vectorizing compilers that improves efficiency and ease of development of hardware designs, particularly for users with little electronics design experience.
Abstract
This paper presents pipeline vectorization, a method for synthesizing hardware pipelines based on software vectorizing compilers. The method improves efficiency and ease of development of hardware designs, particularly for users with little electronics design experience. We propose several loop transformations to customize pipelines to meet hardware resource constraints while maximizing available parallelism. For runtime reconfigurable systems, we apply hardware specialization to increase circuit utilization. Our approach is especially effective for highly repetitive computations in digital signal processor (DSP) and multimedia applications. Case studies using field programmable gate arrays (FPGAs)-based platforms are presented to demonstrate the benefits of our approach and to evaluate tradeoffs between alternative implementations. For instance, the loop-tiling transformation, has been found to improve vectorization performance 30-40 times above a PC-based software implementation, depending on whether runtime reconfiguration (RTR) is used.

read more

Content maybe subject to copyright    Report

234 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2001
Pipeline Vectorization
Markus Weinhardt and Wayne Luk, Member, IEEE
Abstract—This paper presents pipeline vectorization, a method
for synthesizing hardware pipelines based on software vector-
izing compilers. The method improves efficiency and ease of
development of hardware designs, particularly for users with
little electronics design experience. We propose several loop
transformations to customize pipelines to meet hardware resource
constraints while maximizing available parallelism. For runtime
reconfigurable systems, we apply hardware specialization to
increase circuit utilization. Our approach is especially effective
for highly repetitive computations in digital signal processor
(DSP) and multimedia applications. Case studies using field pro-
grammable gate arrays (FPGAs)-based platforms are presented to
demonstrate the benefits of our approach and to evaluate tradeoffs
between alternative implementations. For instance, the loop-tiling
transformation, has been found to improve vectorization perfor-
mance 30–40 times above a PC-based software implementation,
depending on whether runtime reconfiguration (RTR) is used.
Index Terms—High-level synthesis, parallelization, pipelining,
reconfigurable computing, vectorization.
I. INTRODUCTION
M
ANY application developers recognize that the key to ef-
fective use of custom computing systems is to maximize
their available parallelism. This task, which has to be achieved
while meeting specific hardware resource constraints, is diffi-
cult to perform by hand.
Vectorizing compilers have proved successful in detecting
and exploiting parallelism for conventional processors with a
fixed architecture. A vector execution unit adapted for digital
signal processor (DSP) and multimedia processinghasalsobeen
identified as an important component of novel computer archi-
tectures such as the vector IRAM [1]. This paper presents an ap-
proach for automatically producing optimized pipelined circuits
from a high-level program using techniques derived from soft-
ware vectorizing compilers. The compile-time and runtime re-
configurability of field programmable gate arrays (FPGAs) can
also be efficiently exploited.
Our approach, which we call pipeline vectorization [2]–[4],
essentially involves the synthesis of pipelined processors that
executeinner loops of programs. Data dependence analysis sim-
ilar to software vectorization is performed, which determines if
a pipeline can be generated for a loop. Therefore, it generates
circuits that exhibit more parallelism than many other automatic
high-level hardware design tools.
Manuscript received February 29, 1999; revised April 6, 2000. This work
was supported by a European Union training project financed by the Commis-
sion in the TMR program, the U.K. Engineering and Physical Sciences Re-
search Council, Embedded Solutions Ltd., and Xilinx Inc. This paper was rec-
ommended by Associate Editor R. Camposano.
The authors are with the Department of Computing, Imperial Col-
lege, London SW7 2BZ, U.K. (e-mail: m.weinhardt@computer.org;
wl@doc.ic.ac.uk).
Publisher Item Identifier S 0278-0070(01)00943-5.
There are significant differences between pipeline vector-
ization and software vectorization. For instance, our approach
covers a wider range of loops since it does not consider
out-of-order execution. It can be used with a variety of storage
allocation methods. In contrast to software vectorization, we do
not explicitly generate vector instructions. Instead, all instruc-
tions of the loop body are vectorized and chained by pipelining
input data through the entire dataflow graph synthesized from
the loop body. To widen the applicability of our technique, we
devise several loop transformations that adjust the amount of
hardware used in vectorized loops to the available hardware
resources. For reconfigurable implementations, we explore
methods to increase circuit utilization by runtime circuit
specialization and runtime reconfiguration (RTR).
Our approach includes the synthesis of nonpipelined circuitry
for nonvectorizable loops and conditional and sequential pro-
gram code. It can be used in two modes—hardware mode and
codesign mode. In hardware mode, a processor is generated for
the entire program (which includes only synthesizable opera-
tions as defined in Section III-A1), rendering descriptions from
a high-level sequential programming language into an efficient
hardware description language (HDL). Alternatively, in code-
sign mode, parts of the program (such as nonsynthesizable or
highly irregular parts) remain in software to be executed on
a host microprocessor. This mode results in a hardware-soft-
ware codesign system with data and control transfer between
host processor and custom hardware automatically being imple-
mented.
This paper is organized as follows. First, we discuss relevant
previous work. Section III then presents the core pipeline vec-
torization design flow, the main contribution of this paper. Next,
Sections IV and V describe the other important contributions:
optimizing loop transformations and runtime circuit specializa-
tion. Section VI reports on a prototype compiler implementa-
tion and Section VII provides case studies and results evalu-
ating pipeline vectorization. Finally, Section VIII presents con-
clusions and future work.
II. B
ACKGROUND
Increasingly, system description is written in a high-level
software language [5]. This method simplifies algorithm
development and facilitates experiments to map different
components into hardware. Our approach not only supports this
method, but it goes a step further by providing a framework in
which efficient hardware can be produced automatically from a
software description. It is particularly suited for reconfigurable
systems, which have been shown to be useful in various appli-
cation domains [6]. We attempt to simplify the programming
of these systems, which typically consist of a host processor
and FPGA hardware boards [7], [8]. Programming difficulties
0278–0070/01$10.00 © 2001 IEEE
Authorized licensed use limited to: Imperial College London. Downloaded on August 13,2010 at 17:10:28 UTC from IEEE Xplore. Restrictions apply.

WEINHARDT AND LUK: PIPELINE VECTORIZATION 235
include analyzing the tradeoffs between software and hard-
ware, designing software and hardware parts using different
languages and tools and debugging the interface between these
parts.
Some tools approach these problems by using a programming
language input to specify both software and hardware in a uni-
form manner. This enables systematic analysis of the tradeoffs
as well as automatic synthesis of the interface between software
and hardware. However, while a sequential program is a nat-
ural specification for the host processor, hardware coprocessors
synthesized from sequential code often fail to exploit the hard-
ware’s parallelism sufficiently. For instance, this is the case in
the PRISM system [9]. The configuration and communication
overhead is often larger than the achieved speedup itself. Better
results can be obtained for systems that integrate a micropro-
cessor core and reconfigurable hardware on a chip. An example
of this approach is the Garp chip [10] and its
-based compiler
[11].
Guccione adopts data-parallel
vector operations on data
streams to describe pipelined circuits [12], [13]. However, this
method requires the user to learn a new programming language
that is only used for the hardware part of an application. The
same is true for Transmogrifier
[14], a research compiler al-
lowing only task-level parallelism.
Hardware programming systems based on communicating
sequential processes such as OCCAM [15] and Handel-C
[16] are suitable for control intensive applications. However,
the user has to specify parallel operations explicitly. As for
data-parallel
, the software parts of an application must be
written in a different language and interfaced manually to the
hardware parts.
In the application-specific integrated circuit domain,
high-level synthesis systems generate register-transfer struc-
tures from behavioral (algorithmic) specifications [17]–[19].
These methods employ sophisticated scheduling, resource
allocation, and binding techniques for general processor archi-
tectures. They perform an exhaustive design-space exploration,
which makes the tools very slow, especially if compared to a
software compiler. A commercial high-level synthesis system
is Synopsys’ Behavioral Compiler [20], which can handle array
data and generate memory accesses. However, to pipeline loops,
the user has to analyze loop-carried dependences manually (as
defined in Section III-A3) and specify a safe initiation interval,
which preserves the dependences and the original order of
memory reads and writes. Another system, C2Verilog, [21]
uses ANSI
to produce a Verilog register-transfer structure
using high-level synthesis techniques. However, it does not
perform loop pipelining.
Several research projects address the automatic synthesis of
pipelined circuits from program loops. The closest to our ap-
proach is the NAPA
compiler [22]. However, that system tar-
gets specificallytheNAPA processor [23] and considers only in-
nermost loops. No automatic vectorization or optimizing trans-
formations similar to ours are reported. The scheduling of in-
structions is performed on an atomic basis and, thus, is less
flexible than hardware pipelining, which can also use internally
pipelined operators. Finally, a loop parallelization method is
Fig. 1. Core design flow.
also included in the Alpha System [24]. However, it is restricted
to producing linear systolic arrays whereas our techniques can
vectorize more general programs.
III. C
ORE DESIGN FLOW
We first present the core pipeline-vectorization design flow,
as shown in Fig. 1. Most parts of the figure are relevant for both
the codesign and the hardware mode. Only the software branch
on the bottom left side and the hardware–software partitioning
phase do not exist for the hardware mode. The core design flow
consists of three major phases: preprocessing, hardware syn-
thesis, and partitioning and integration. They are described in
the following sections. The extension of the core design flow to
include loop transformations will be covered in Section IV.
A. Preprocessing
Preprocessing consists of four steps: hardware candidate se-
lection (Section III-A1), loop normalization (Section III-A2),
Authorized licensed use limited to: Imperial College London. Downloaded on August 13,2010 at 17:10:28 UTC from IEEE Xplore. Restrictions apply.

236 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2001
dependence analysis (Section III-A3), and removing vector de-
pendences (Section III-A4). These steps are necessary to per-
form hardware synthesis effectively.
1) Hardware Candidate Selection: This step selects pro-
gram parts suitable for hardware synthesis. Regular iterative
computations, which perform identical operations on a large
set of data. are likely to achieve high performance in hardware.
Hence, loops are natural candidates for hardware processors.
We attempt to vectorize innermost FOR loops and generate
pipelines for them. FOR loops have an induction (index)
variable necessary for normalization and vectorization and
have predetermined loop counts. Thus, they can be handled
by efficient control circuitry. It is possible to transform some
WHILE loops to FOR loops by induction variable detection
[25]. Other WHILE loops, outer loops, and other program
constructs are nonpipeline candidates and, therefore, are
unlikely to result in fast efficient hardware. They should only
be considered in combination with pipeline candidates or left
for software execution. However, our procedure will consider
all loops since the loop transformations presented in Section IV
rearrange loop nests.
There are some additional restrictions for the candidates: they
must not contain nonsynthesizable operations such as recursive
function calls, external operating system calls, or library calls.
1
In hardware mode, programs containing any nonsynthesiz-
able operations are not considered legal input. Thus, the entire
legal input program is a candidate and candidate selection only
distinguishes between pipeline and nonpipeline candidates.
The example edge detector program in Fig. 2 will be used to
synthesize a pipeline circuit. Its inner loop is a pipeline candi-
date and its outer loop (by definition) a nonpipeline candidate.
We use
language syntax here because our compiler prototype
(Section VI) uses a
front end. Our approach is valid for any
sequential imperative programming language.
2) Loop Normalization: For vectorization, we have to nor-
malize the pipeline candidate loops by the following transfor-
mations.
Remove all additional induction variables and normalize
the loop’s lower bound to zero and its step to one (induc-
tion variable substitution [26]).
Normalize the index expressions to linear expressions of
the induction variable (subscript normalization [26]). For
induction variable
, the resulting expression has the form
. is called the access’ stride.
If one or more index expressions cannot be normalized, the
loop is only a nonpipeline candidate. In particular, indirect array
accesses, where array elements are accessed by intermediate re-
sults prevent vectorization as in software vectorizing compilers
[26].
Normalizing the inner candidate loop in Fig. 2 will create the
loop header
and substitute by in the loop body.
1
Nonrecursive function calls can be inlined. Therefore, we assume—without
loss of generality—that no function calls exist in the candidates.
Fig. 2. Edge detector program.
3) Dependence Analysis: The next processing stage ana-
lyzes pipeline candidate loops for dependences. There are three
general types of dependences [27]. True or flow dependence
occurs when a variable is assigned or defined in one statement
and used in a subsequently executed statement. Antidependence
occurs when a variable is used in one statement and reassigned
in a subsequently executed statement. Output dependence oc-
curs when a variable is assigned in one statement and reassigned
in a subsequently executed statement. General dependences
are either loop independent or loop carried. The former occurs
between statements in the same loop iteration and the latter
between statements in different iterations. For loop-carried de-
pendences, the dependence distance is the number of iterations
between the statements that cause the dependence. In a loop
nest, we determine for each loop hierarchy the loop-carried
dependences since only these affect the loop-level parallelism.
Since pipeline execution overlaps the loop iterations but
maintains their order, memory writes are never out of order.
Hence, we only have to consider true dependences but not
anti or output dependences. Therefore, pipeline vectorization
applies to more loops than software vectorization. We utilize
standard dependence analysis methods [27] to detect these
dependences. Unfortunately, these methods are not completely
accurate. In some cases, they cannot determine the absence of
dependences, thus failing to detect potential parallelism.
Next, we check if the detected true loop-carried dependences
occur in all loop iterations with the same dependence distance.
We call these dependences regular. All dependences stemming
from scalar variables and from array accesses with the same
stride are regular [3]. It is possible to synthesize hardware
obeying these dependences, but the resulting circuits may
contain feedback cycles. In contrast to software vectorization,
regular dependences do not prevent pipeline synthesis, although
they can reduce parallelism because the feedback paths restrict
the speedup achieved by pipelining in a later processing stage.
Irregular dependences can be handled provided that the orig-
inal order of read and write accesses of the arrays involved
are maintained. However, this usually requires many sequential
memory accesses and is only feasible with very fast memories
such as on-chip memories.
In the program in Fig. 2, there are no dependences from array
accesses: array
is only read and array is only written.
Authorized licensed use limited to: Imperial College London. Downloaded on August 13,2010 at 17:10:28 UTC from IEEE Xplore. Restrictions apply.

WEINHARDT AND LUK: PIPELINE VECTORIZATION 237
Fig. 3. Fibonacci numbers program.
The followingexample showsthatour approach can deal with
loops not usually vectorized by software vectorizing compilers.
The loop
has a loop-carried dependence stemming from the assignment to
array
but no value written to memory is read in a subsequent
loop iteration. Only out-of-order execution of the assignments
would lead to a real dependence. Hence, it is an antidependence
and we can disregard it and vectorize the loop.
To the contrary, the program computing the Fibonacci num-
bers in Fig. 3 contains true loop-carried dependences. The as-
signment to
depends on the two previous assignments.
Since both dependences are regular (dependencedistances1 and
2), we can generate a pipeline circuit for this program. But the
dependences have to be handled as described in the next para-
graph.
4) Removing Vector Dependences: The hardware synthesis
technique presented in the next section cannot handle true loop-
carried dependences stemming from array (vector) accesses, as
shown in Fig. 3. Therefore, pipeline candidate loops have to be
transformed to remove them in the following way. First, vector
accesses depending on earlier iterations are substituted by new
scalar variables in the candidate loop body. Next, at the end of
the loop body, instructions are inserted that assign these vari-
ables to the values they depend on in the original program. For
dependences with dependence distances larger than one, addi-
tional variables and assignments are inserted. Finally, assign-
ments to initialize the variables are added before the loop. For
the Fibonacci number program, accesses
and are
substituted. Fig. 4 shows the resulting transformed program. It
only contains dependences stemming from the new scalar vari-
ables
and . In every iteration, their values from the
previous iteration are read. These dependences will be handled
by the hardware synthesis phase.
B. Hardware Synthesis
The hardware synthesis phase contains three steps:
dataflow graph generation (Section III-B1) and extension
(Section III-B2), pipelining (Section III-B3), and controller
synthesis (Section III-B4).
For those candidate loops that pass the dependence test, in-
dependent pipeline circuits are synthesized. They are later inte-
grated in largerdesigns or instantiated in separate configurations
(see Section III-C).
Various storage allocation schemes can be used. For instance,
scalar variables can be held in hardware registers and arrays
can be stored on off-chip memory. On some FPGA families,
small arrays of data can also be stored in very fast on-chip
Fig. 4. Transformed Fibonacci program.
memory. Array elements are fed to the pipeline as continuous
data streams through vector inputs and output streams are
written back to local memory through vector outputs. In this
way, one loop iteration is executed every pipeline cycle. All
element addresses for the linear array accesses can be computed
in parallel with the loop computations. Thus, they do not slow
down the application. However, for arbitrary accesses, address
computations depend on loop computation results and must
be scheduled accordingly, thereby slowing down the circuits.
Pointer accesses are indirect accesses to the entire host memory
space and are only possible in tightly coupled architectures
with direct memory access (DMA).
We first generate an acyclic dataflow graph for the loop body.
Next, regular loop-carried dependences are resolved, possibly
introducing feedback cycles. Finally, the circuit is pipelined and
a controller is synthesized.
1) Dataflow Graph Generation: We generate an acyclic
combinational dataflow graph for the loop body by analyzing
its internal dependences and allocating a new operator for each
operation in the program’s expressions. This simple “direct
compilation” shares no resources within the loop body but
later allows overlapping loop iterations by pipelining. We treat
array accesses and scalar variables uniformly. Since the loop
body can only contain linear code and conditional statements,
we can use a control flow/data flow transformation [19] to
generate one combined dataflow graph for the entire loop body.
It computes all program branches in parallel and uses multi-
plexers to select the correct values of conditionally assigned
variables. Resources in these mutually exclusive paths can
be shared without interfering with pipelining (cf. the PISYN
system [18]). For instance, an adder and a subtractor with the
same inputs can be replaced by a combined adder/subtractor if
their outputs are not required concurrently. Our dataflow graph
generation is similar to the method used in the Transmogrifier
compiler [14], but our method avoids unnecessary memory
accesses: when an input value remains unchanged in one branch
of a conditional statement, we do not read the old value in and
write the unchanged value back. Instead, write-enable signals
are generated for the RAM accesses to write values only if the
appropriate conditions are met.
To further reduce redundant memory accesses, index-shifted
accesses to the same array are combined and realized by shift
registers [28]. Using these delayed values of the input stream
avoids accessing the same value in memory more than once and
reduces the number of required vector inputs. This reduction is
crucial since all vector input streams must be read and all output
streams written once for every loop iteration. Thus, the pipeline
Authorized licensed use limited to: Imperial College London. Downloaded on August 13,2010 at 17:10:28 UTC from IEEE Xplore. Restrictions apply.

238 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2001
Fig. 5. Edge detector dataflow graph.
Fig. 6. Incomplete Fibonacci dataflow graph.
throughput directly depends on the number of vector inputs and
outputs.
Fig. 5 showsthe dataflowgraph for the edge detector program
from Fig. 2. There are three vector inputs for
, ,
and
and a vector output for . The shift registers
are represented by delay elements
. Note that the statement
of the loop body is implemented by a multiplexer, which either
selects the conditionally assigned value (
) or the unchanged
value of
generated in the previous statement.
The dataflow graph generated for the transformed Fibonacci
program in Fig. 4 is shown in Fig. 6. Obviously, this circuit does
not produce correct outputs since the loop-carried dependences
stemming from variables
and are not accounted for.
The registers are only initialized but never updated. The next
section shows how the circuit can be altered to produce correct
output.
2) Dependences and Feedback Cycles: If a loop has reg-
ular loop-carried dependences, the dataflow graph must be ex-
tended to use the correct values upon which a computation de-
pends. The transformation in Section III-A4 substituted depen-
dent array accesses by scalar variables. Thus, all loop-carried
dependences remaining in a pipeline candidate stem from scalar
variables. Such dependences are treated in the following way.
Since one loop iteration is executed every pipeline cycle, the
input register of such a variable (which is read and written in
the loop) must always contain the value computed in the pre-
vious pipeline cycle. To achieve this, a multiplexer is added at
the register’s input. It selects the input value during initializa-
tion and the feedback value during normal operation depending
on an external control signal provided by the environment.
Fig. 7 shows the result for the Fibonacci program in Fig. 4.
For
and , multiplexers have been inserted between
the inputs and the registers storing the variables. During initial-
ization, the control signal
selects the input values
that are used for the first loop iteration. All subsequent iterations
select the other inputs of the multiplexers that are connected to
Fig. 7. Complete Fibonacci dataflow graph.
the output values of the variables in the previous iteration. In
this example, a feedback cycle from the output of the adder to
the register holding
is created (bold lines in Fig. 7). How-
ever, not all dependences result in feedback cycles.
This example shows that the feedback operators synthe-
sized by pipeline vectorization are more general than those
available in single-instruction multiple-data (SIMD) parallel
programming languages [27]. Such languages feature special
REDUCE or SCAN operations but they are limited to single
operators with direct output feedback to one input (for instance,
ADD-SCAN for an accumulator). The same is true for software
vectorizing compilers, which extract these operations. Arbi-
trarily, customized feedback units, as the one shown in Fig. 7,
are only possible when a pipeline is implemented in hardware.
3) Pipelining and Timing: So far, we have generated a
dataflow graph that computes one loop iteration once all input
registers are set. It may not be very efficient because the
combinational delays of chained operators may accumulate to
a long critical path. The critical path delay can be reduced by
pipelining, effectively overlapping different loop iterations, and
thereby improving the performance. Although the latency is
also increased, it often has only a minimal effect since the time
for filling and flushing the pipeline is normally negligible.
Theoretically, it is possible to pipeline an acyclic dataflow
graph very deeply and run it at a very high clock speed. In a
practical implementation, however, the system clock cycle
is
restricted by the combinational delay of the controller (cf. Sec-
tion III-B4). Since most pipelines are fed by data from external
memory, we also require that an external memory access (as-
suming synchronous RAM) completes in one cycle. Hence, we
choose an appropriate value for
for each target architecture,
cf. Section VI-C.
Authorized licensed use limited to: Imperial College London. Downloaded on August 13,2010 at 17:10:28 UTC from IEEE Xplore. Restrictions apply.

Citations
More filters
Journal ArticleDOI

Reconfigurable computing: architectures and design methods

TL;DR: It is shown that reconfigurable computing designs are capable of achieving up to 500 times speedup and 70% energy savings over microprocessor implementations for specific applications.
Journal ArticleDOI

PACT XPP—A Self-Reconfigurable Data Processing Architecture

TL;DR: The eXtreme Processing Platform (XPPTM) is a new runtime-reconfigurable data processing architecture based on a hierarchical array of coarsegrain, adaptive computing elements, and a packet-oriented communication network that is well suited for applications in multimedia, telecommunications, simulation, signal processing, graphics, and similar stream-based application domains.
Patent

Process for automatic dynamic reloading of data flow processors (dfps) and units with two-or-three-dimensional programmable cell architectures (fpgas, dpgas, and the like)

TL;DR: In this paper, the first result data may be obtained using a plurality of configurable coarse-granular elements, and the first results may be subsequently processed using the plurality of configured granular elements.
Patent

Data processing device and method

TL;DR: In this paper, a data processing device comprising a multidimensional array of coarse grained logic elements processing data and operating at a first clock rate and communicating with one another and/or other elements via busses or communication lines operated at a second clock rate is described.
Journal ArticleDOI

An overview of reconfigurable hardware in embedded systems

TL;DR: An overview of reconfigurable computing in embedded systems, in terms of benefits it can provide, how it has already been used, design issues, and hurdles that have slowed its adoption are presented.
References
More filters
Book

High-Performance Compilers for Parallel Computing

TL;DR: This book discusses Programming Language Features, Data Dependence, Dependence System Solvers, and Run-time Dependence Testing for High Performance Systems.
Book

High ― Level Synthesis: Introduction to Chip and System Design

TL;DR: This paper presents a methodology for High-Level Synthesis of Architectural Models in Synthesis and its applications in Design Description Languages and Design Representation and Transformations.
Proceedings ArticleDOI

Garp: a MIPS processor with a reconfigurable coprocessor

TL;DR: Novel aspects of the Garp Architecture are presented, as well as a prototype software environment and preliminary performance results, which suggest that a Garp of similar technology could achieve speedups ranging from a factor of 2 to as high as a factors of 24 for some useful applications.
Book

Supercompilers for parallel and vector computers

TL;DR: This paper presents a meta-modelling architecture for supercompilers that automates the very labor-intensive and therefore time-heavy and expensive process of learning and optimization of supercomputing systems.
Journal ArticleDOI

A loop transformation theory and an algorithm to maximize parallelism

TL;DR: The loop transformation theory is applied to the problem of maximizing the degree of coarse- or fine-grain parallelism in a loop nest and it is shown that the maximum degree of parallelism can be achieved by transforming the loops into a nest of coarsest fullypermutable loop nests and wavefronting the fully permutable nests.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What are the contributions in this paper?

This paper presents pipeline vectorization, a method for synthesizing hardware pipelines based on software vectorizing compilers. The authors propose several loop transformations to customize pipelines to meet hardware resource constraints while maximizing available parallelism. Their approach is especially effective for highly repetitive computations in digital signal processor ( DSP ) and multimedia applications. Case studies using field programmable gate arrays ( FPGAs ) -based platforms are presented to demonstrate the benefits of their approach and to evaluate tradeoffs between alternative implementations. 

Future work will include combining the fine-grain vectorization presented in this paper with coarse-grain task-level parallelism. Strategies to transform entire loop nests will also be studied and automatic partitioning will be included in their compiler prototype. Further extensions will allow users to include manually designed hardware blocks and to synthesize digit-serial designs. 

In software compilers, loop unrolling is an important technique to increase basic block sizes, extending the scope of local optimizations. 

The pipeline cycle must contain at least clock cycles, where is the number of clock cycles needed to perform all vector accesses to external memory required for one loop iteration. 

If a loop has regular loop-carried dependences, the dataflow graph must be extended to use the correct values upon which a computation depends. 

hardware sharing may increase the amount of routing to the shared resource, increasing both delay and size of the resulting circuit. 

By storing them in registers, all input and output values are presented to the pipeline synchronously at the beginning of a pipeline cycle. 

For loop-carried dependences, the dependence distance is the number of iterations between the statements that cause the dependence. 

The XC 6216 is large enough to implement the controller and 54 CTR processing elements or 90 smaller specialized RTR processing elements. 

In these cases, it is very beneficial to partially unroll a loop, thereby adjusting the circuit size to the given hardware resources, and vectorize the next outer loop. 

For FPGA implementations where there are abundant latches, the authors generate a one-hot controller triggered by an external START signal. 

The last line in Table II shows that the advantage of loop merging is limited for one memory bank since too many memory accesses have to be performed sequentially in one cycle. 

the controller initializes the pipeline loop’s index variable and then repeatedly loops through cycles to complete a pipeline cycle. 

If the resulting delay of an assignment becomes larger than , the clock-cycle time of the pipelined circuit, the assignment is performed in several cycles.