Accelerate Non-unit Stride Convolutions with Winograd Algorithms

doi:10.1145/3394885.3431534

Proceedings ArticleDOI

Accelerate Non-unit Stride Convolutions with Winograd Algorithms

- pp 358-364

TLDR

In this paper, a universal approach was proposed to construct non-unit stride convolution algorithms for any given stride and filter sizes from Winograd algorithms. But, this method cannot directly apply those algorithms to accelerate the computations.

Abstract:

While computer vision tasks target increasingly challenging scenarios, the need for real-time processing of images rises as well, requiring more efficient methods to accelerate convolutional neural networks For unit stride convolutions, we use FFT-based methods and Winograd algorithms to compute matrix convolutions, which effectively lower the computing complexity by reducing the number of multiplications For non-unit stride convolutions, we usually cannot directly apply those algorithms to accelerate the computations In this work, we propose a novel universal approach to construct the non-unit stride convolution algorithms for any given stride and filter sizes from Winograd algorithms Specifically, we first demonstrate the steps to decompose an arbitrary convolutional kernel and apply the Winograd algorithms separately to compute non-unit stride convolutions We then present the derivation of this method and proof by construction to confirm the validity of this approach Finally, we discuss the minimum number of multiplications and additions necessary for the non-unit stride convolutions and evaluate the performance of the decomposed Winograd algorithms From our analysis of the computational complexity, the new approach can benefit from 15x to 3x fewer multiplications In our experiments in real DNN layers, we have acquired around 13x speedup (T old / T new ) of the Winograd algorithms against the conventional convolution algorithm in various experiment settings

Accelerate Non-unit Stride Convolutions with Winograd Algorithms

Citations

Evaluating FFT-based algorithms for strided convolutions on ARMv8 architectures

RiSA: A Reinforced Systolic Array for Depthwise Convolutions and Embedded Tensor Reshaping

Low-Cost Online Convolution Checksum Checker

mGEMM: low-latency convolution with minimal memory overhead optimized for mobile devices

FPGA-based reflection image removal using cognitive neural networks

References

Very Deep Convolutional Networks for Large-Scale Image Recognition

Deep Residual Learning for Image Recognition

ImageNet classification with deep convolutional neural networks

Fully convolutional networks for semantic segmentation

Rethinking the Inception Architecture for Computer Vision

Related Papers (5)

A Stride-Based Convolution Decomposition Method to Stretch CNN Acceleration Algorithms for Efficient and Flexible Hardware Implementation

Efficient Winograd Convolution via Integer Arithmetic.

Evaluating FFT-based algorithms for strided convolutions on ARMv8 architectures

Low-Complexity Winograd Convolution Architecture Based on Stochastic Computing

Zero and data reuse-aware fast convolution for deep neural networks on GPU