scispace - formally typeset
Search or ask a question
Topic

Loop unrolling

About: Loop unrolling is a research topic. Over the lifetime, 777 publications have been published within this topic receiving 13486 citations. The topic is also known as: loop unwinding.


Papers
More filters
Proceedings ArticleDOI
Michael Wolfe1
29 Jun 2020
TL;DR: This work will focus a great deal on the importance of compilers in supercomputing, and compare and contrast the advantages and impacts of compiler solutions to the "Performance + Portability + Productivity" problem with language and runtime solutions.
Abstract: Between a problem statement and its solution as a computer simulation are several steps, from choosing a method, writing a program, compiling to machine code, making runtime decisions, and hardware execution. Here we will look at the middle three decision points. What decisions should be and must be left to the programmer? What decisions should be and must be relegated to a compiler? What decisions should be and must be left until runtime? Given my background, I will focus a great deal on the importance of compilers in supercomputing, and compare and contrast the advantages and impacts of compiler solutions to the "Performance + Portability + Productivity" problem with language and runtime solutions.

729 citations

Proceedings ArticleDOI
13 May 2012
TL;DR: This work performs auto-tuning on a large optimization space on GPU kernels, focusing on loop permutation, loop unrolling, tiling, and specifying which loop(s) to parallelize, and shows results on convolution kernels, codes in the PolyBench suite, and an implementation of belief propagation for stereo vision.
Abstract: Determining the best set of optimizations to apply to a kernel to be executed on the graphics processing unit (GPU) is a challenging problem. There are large sets of possible optimization configurations that can be applied, and many applications have multiple kernels. Each kernel may require a specific configuration to achieve the best performance, and moving an application to new hardware often requires a new optimization configuration for each kernel. In this work, we apply optimizations to GPU code using HMPP, a high-level directive-based language and source-to-source compiler that can generate CUDA / OpenCL code. However, programming with high-level languages may mean a loss of performance compared to using low-level languages. Our work shows that it is possible to improve the performance of a high-level language by using auto-tuning. We perform auto-tuning on a large optimization space on GPU kernels, focusing on loop permutation, loop unrolling, tiling, and specifying which loop(s) to parallelize, and show results on convolution kernels, codes in the PolyBench suite, and an implementation of belief propagation for stereo vision. The results show that our auto-tuned HMPP-generated implementations are significantly faster than the default HMPP implementation and can meet or exceed the performance of manually coded CUDA / OpenCL implementations.

402 citations

Proceedings ArticleDOI
22 Feb 2017
TL;DR: This work systematically explore the trade-offs of hardware cost by searching the design variable configurations, and proposes a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance.
Abstract: As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs involves three-dimensional multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. required memory access) of the CNN accelerator based on multiple design variables. We systematically explore the trade-offs of hardware cost by searching the design variable configurations, and propose a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated on a standalone Altera Arria 10 GX 1150 FPGA by implementing end-to-end VGG-16 CNN model and achieved 645.25 GOPS of throughput and 47.97 ms of latency, which is a >3.2× enhancement compared to state-of-the-art FPGA implementations of VGG model.

348 citations

Proceedings ArticleDOI
20 Apr 2004
TL;DR: This paper presents the architecture of a fully pipelined AES encryption processor on a single chip FPGA by using loop unrolling and inner-round and outer-round pipelining techniques, and achieves a maximum throughput of 21.54 Gbits/s.
Abstract: This paper presents the architecture of a fully pipelined AES encryption processor on a single chip FPGA. By using loop unrolling and inner-round and outer-round pipelining techniques, a maximum throughput of 21.54 Gbits/s is achieved. A fast and an area efficient composite field implementation of the byte substitution phase is designed using an optimum number of pipeline stages for FPGA implementation. A 21.54 Gbits/s throughput is achieved using 84 block RAMs and 5177 slices of a VirtexII-Pro FPGA with a latency of 31 cycles and throughput per area rate of 4.2 Mbps/Slice.

260 citations

Journal ArticleDOI
TL;DR: The authors study run-time methods to automatically parallelize and schedule iterations of a do loop in certain cases where compile-time information is inadequate and present performance results from experiments conducted on the Encore Multimax, illustrating that run- time reordering of loop indexes can have a significant impact on performance.
Abstract: The authors study run-time methods to automatically parallelize and schedule iterations of a do loop in certain cases where compile-time information is inadequate. The methods presented involve execution time preprocessing of the loop. At compile-time, these methods set up the framework for performing a loop dependency analysis. At run-time, wavefronts of concurrently executable loop iterations are identified. Using this wavefront information, loop iterations are reordered for increased parallelism. The authors utilize symbolic transformation rules to produce: inspector procedures that perform execution time preprocessing, and executors or transformed versions of source code loop structures. These transformed loop structures carry out the calculations planned in the inspector procedures. The authors present performance results from experiments conducted on the Encore Multimax. These results illustrate that run-time reordering of loop indexes can have a significant impact on performance. >

256 citations


Network Information
Related Topics (5)
Compiler
26.3K papers, 578.5K citations
85% related
Cache
59.1K papers, 976.6K citations
83% related
Parallel algorithm
23.6K papers, 452.6K citations
81% related
Programming paradigm
18.7K papers, 467.9K citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202319
202235
202111
202029
201927
201849