Pipeline vectorization
read more
Citations
Reconfigurable computing: architectures and design methods
PACT XPP—A Self-Reconfigurable Data Processing Architecture
Process for automatic dynamic reloading of data flow processors (dfps) and units with two-or-three-dimensional programmable cell architectures (fpgas, dpgas, and the like)
Data processing device and method
An overview of reconfigurable hardware in embedded systems
References
High-Performance Compilers for Parallel Computing
High ― Level Synthesis: Introduction to Chip and System Design
Garp: a MIPS processor with a reconfigurable coprocessor
Supercompilers for parallel and vector computers
A loop transformation theory and an algorithm to maximize parallelism
Related Papers (5)
Frequently Asked Questions (14)
Q2. What are the future works in this paper?
Future work will include combining the fine-grain vectorization presented in this paper with coarse-grain task-level parallelism. Strategies to transform entire loop nests will also be studied and automatic partitioning will be included in their compiler prototype. Further extensions will allow users to include manually designed hardware blocks and to synthesize digit-serial designs.
Q3. What is the purpose of loop unrolling?
In software compilers, loop unrolling is an important technique to increase basic block sizes, extending the scope of local optimizations.
Q4. How many cycles are needed to perform all vector accesses to external memory?
The pipeline cycle must contain at least clock cycles, where is the number of clock cycles needed to perform all vector accesses to external memory required for one loop iteration.
Q5. What is the effect of a regular loop on the dataflow graph?
If a loop has regular loop-carried dependences, the dataflow graph must be extended to use the correct values upon which a computation depends.
Q6. What is the effect of hardware sharing on the resulting circuit?
hardware sharing may increase the amount of routing to the shared resource, increasing both delay and size of the resulting circuit.
Q7. How do you store the input and output values in the pipeline?
By storing them in registers, all input and output values are presented to the pipeline synchronously at the beginning of a pipeline cycle.
Q8. What is the dependence distance for loop-carried dependences?
For loop-carried dependences, the dependence distance is the number of iterations between the statements that cause the dependence.
Q9. How many smaller RTR processing elements can be used?
The XC 6216 is large enough to implement the controller and 54 CTR processing elements or 90 smaller specialized RTR processing elements.
Q10. What is the way to unroll a loop?
In these cases, it is very beneficial to partially unroll a loop, thereby adjusting the circuit size to the given hardware resources, and vectorize the next outer loop.
Q11. What is the simplest way to generate a one-hot controller?
For FPGA implementations where there are abundant latches, the authors generate a one-hot controller triggered by an external START signal.
Q12. What is the advantage of loop merging?
The last line in Table II shows that the advantage of loop merging is limited for one memory bank since too many memory accesses have to be performed sequentially in one cycle.
Q13. What is the process of initializing the pipeline loop’s index variable?
the controller initializes the pipeline loop’s index variable and then repeatedly loops through cycles to complete a pipeline cycle.
Q14. How long does the assignment take to complete?
If the resulting delay of an assignment becomes larger than , the clock-cycle time of the pipelined circuit, the assignment is performed in several cycles.