Topic

Loop unrolling

About: Loop unrolling is a research topic. Over the lifetime, 777 publications have been published within this topic receiving 13486 citations. The topic is also known as: loop unwinding.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Optimizing supercompilers for supercomputers

[...]

Michael Wolfe¹•Institutions (1)

Nvidia¹

29 Jun 2020

TL;DR: This work will focus a great deal on the importance of compilers in supercomputing, and compare and contrast the advantages and impacts of compiler solutions to the "Performance + Portability + Productivity" problem with language and runtime solutions.

...read moreread less

Abstract: Between a problem statement and its solution as a computer simulation are several steps, from choosing a method, writing a program, compiling to machine code, making runtime decisions, and hardware execution. Here we will look at the middle three decision points. What decisions should be and must be left to the programmer? What decisions should be and must be relegated to a compiler? What decisions should be and must be left until runtime? Given my background, I will focus a great deal on the importance of compilers in supercomputing, and compare and contrast the advantages and impacts of compiler solutions to the "Performance + Portability + Productivity" problem with language and runtime solutions.

...read moreread less

729 citations

Proceedings Article•DOI•

Auto-tuning a high-level language targeted to GPU codes

[...]

Scott Grauer-Gray¹, Lifan Xu¹, Robert Searles¹, Sudhee Ayalasomayajula¹, John Cavazos¹ - Show less +1 more•Institutions (1)

University UCINF¹

13 May 2012

TL;DR: This work performs auto-tuning on a large optimization space on GPU kernels, focusing on loop permutation, loop unrolling, tiling, and specifying which loop(s) to parallelize, and shows results on convolution kernels, codes in the PolyBench suite, and an implementation of belief propagation for stereo vision.

...read moreread less

Abstract: Determining the best set of optimizations to apply to a kernel to be executed on the graphics processing unit (GPU) is a challenging problem. There are large sets of possible optimization configurations that can be applied, and many applications have multiple kernels. Each kernel may require a specific configuration to achieve the best performance, and moving an application to new hardware often requires a new optimization configuration for each kernel. In this work, we apply optimizations to GPU code using HMPP, a high-level directive-based language and source-to-source compiler that can generate CUDA / OpenCL code. However, programming with high-level languages may mean a loss of performance compared to using low-level languages. Our work shows that it is possible to improve the performance of a high-level language by using auto-tuning. We perform auto-tuning on a large optimization space on GPU kernels, focusing on loop permutation, loop unrolling, tiling, and specifying which loop(s) to parallelize, and show results on convolution kernels, codes in the PolyBench suite, and an implementation of belief propagation for stereo vision. The results show that our auto-tuned HMPP-generated implementations are significantly faster than the default HMPP implementation and can meet or exceed the performance of manually coded CUDA / OpenCL implementations.

...read moreread less

402 citations

Proceedings Article•DOI•

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

[...]

Yufei Ma¹, Yu Cao¹, Sarma Vrudhula¹, Jae-sun Seo¹•Institutions (1)

Arizona State University¹

22 Feb 2017

TL;DR: This work systematically explore the trade-offs of hardware cost by searching the design variable configurations, and proposes a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance.

...read moreread less

Abstract: As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs involves three-dimensional multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. required memory access) of the CNN accelerator based on multiple design variables. We systematically explore the trade-offs of hardware cost by searching the design variable configurations, and propose a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated on a standalone Altera Arria 10 GX 1150 FPGA by implementing end-to-end VGG-16 CNN model and achieved 645.25 GOPS of throughput and 47.97 ms of latency, which is a >3.2× enhancement compared to state-of-the-art FPGA implementations of VGG model.

...read moreread less

348 citations

Proceedings Article•DOI•

A 21.54 Gbits/s fully pipelined AES processor on FPGA

[...]

A. Hodjat¹, Ingrid Verbauwhede¹•Institutions (1)

University of California, Los Angeles¹

20 Apr 2004

TL;DR: This paper presents the architecture of a fully pipelined AES encryption processor on a single chip FPGA by using loop unrolling and inner-round and outer-round pipelining techniques, and achieves a maximum throughput of 21.54 Gbits/s.

...read moreread less

Abstract: This paper presents the architecture of a fully pipelined AES encryption processor on a single chip FPGA. By using loop unrolling and inner-round and outer-round pipelining techniques, a maximum throughput of 21.54 Gbits/s is achieved. A fast and an area efficient composite field implementation of the byte substitution phase is designed using an optimum number of pipeline stages for FPGA implementation. A 21.54 Gbits/s throughput is achieved using 84 block RAMs and 5177 slices of a VirtexII-Pro FPGA with a latency of 31 cycles and throughput per area rate of 4.2 Mbps/Slice.

...read moreread less

260 citations

Journal Article•DOI•

Run-time parallelization and scheduling of loops

[...]

Joel H. Saltz¹, Ravi Mirchandaney², Kay Crowley²•Institutions (2)

Langley Research Center¹, Yale University²

01 May 1991-IEEE Transactions on Computers

TL;DR: The authors study run-time methods to automatically parallelize and schedule iterations of a do loop in certain cases where compile-time information is inadequate and present performance results from experiments conducted on the Encore Multimax, illustrating that run- time reordering of loop indexes can have a significant impact on performance.

...read moreread less

Abstract: The authors study run-time methods to automatically parallelize and schedule iterations of a do loop in certain cases where compile-time information is inadequate. The methods presented involve execution time preprocessing of the loop. At compile-time, these methods set up the framework for performing a loop dependency analysis. At run-time, wavefronts of concurrently executable loop iterations are identified. Using this wavefront information, loop iterations are reordered for increased parallelism. The authors utilize symbolic transformation rules to produce: inspector procedures that perform execution time preprocessing, and executors or transformed versions of source code loop structures. These transformed loop structures carry out the calculations planned in the inspector procedures. The authors present performance results from experiments conducted on the Encore Multimax. These results illustrate that run-time reordering of loop indexes can have a significant impact on performance. >

...read moreread less

256 citations

Collapse

Network Information

Performance

Metrics

830

Papers

14,468

Citations

No. of papers in the topic in previous years
Year	Papers
2023	19
2022	35
2021	11
2020	29
2019	27
2018	49

Loop unrolling

Papers published on a yearly basis

Papers

Trending Questions (6)

Network Information

Related Topics (5)

Performance

Metrics