Showing papers by "Rezaul Chowdhury published in 2021"

PDF

Open Access

Proceedings Article•DOI•

Fast Stencil Computations using Fast Fourier Transforms

[...]

Zafar Ahmad¹, Rezaul Chowdhury¹, Rathish Das², Pramod Ganapathi¹, Aaron Gregory¹, Yimin Zhu¹ - Show less +2 more•Institutions (2)

Stony Brook University¹, University of Waterloo²

06 Jul 2021

TL;DR: In this article, the authors present two efficient parallel algorithms for performing linear stencil computations using Fast Fourier Transform (DFT) preconditioning on a Krylov subspace method.

...read moreread less

Abstract: Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The state-of-the-art techniques in this area fall into three groups: cache-aware tiled looping algorithms, cache-oblivious divide-and-conquer trapezoidal algorithms, and Krylov subspace methods. In this paper, we present two efficient parallel algorithms for performing linear stencil computations. Current direct solvers in this domain are computationally inefficient, and Krylov methods require manual labor and mathematical training. We solve these problems for linear stencils by using DFT preconditioning on a Krylov method to achieve a direct solver which is both fast and general. Indeed, while all currently available algorithms for solving general linear stencils perform Θ(NT) work, where N is the size of the spatial grid and T is the number of timesteps, our algorithms perform o(NT) work. To the best of our knowledge, we give the first algorithms that use fast Fourier transforms to compute final grid data by evolving the initial data for many timesteps at once. Our algorithms handle both periodic and aperiodic boundary conditions, and achieve polynomially better performance bounds (i.e., computational complexity and parallel runtime) than all other existing solutions. Initial experimental results show that implementations of our algorithms that evolve grids of roughly 10^7 cells for around 10^5 timesteps run orders of magnitude faster than state-of-the-art implementations for periodic stencil problems, and 1.3× to 8.5× faster for aperiodic stencil problems.

...read moreread less

8 citations

Proceedings Article•DOI•

Low-Span Parallel Algorithms for the Binary-Forking Model

[...]

Zafar Ahmad¹, Rezaul Chowdhury¹, Rathish Das², Pramod Ganapathi¹, Aaron Gregory¹, Mohammad Mahdi Javanmard¹ - Show less +2 more•Institutions (2)

Stony Brook University¹, University of Waterloo²

06 Jul 2021

TL;DR: In this paper, a randomized comparison-based sorting algorithm with optimal O(log n) span and O(n log n) work was proposed for the binary-forking model.

...read moreread less

Abstract: The binary-forking model is a parallel computation model, formally defined by Blelloch et al., in which a thread can fork a concurrent child thread, recursively and asynchronously. The model incurs a cost of Θ(log n) to spawn or synchronize n tasks or threads. The binary-forking model realistically captures the performance of parallel algorithms implemented using modern multithreaded programming languages on multicore shared-memory machines. In contrast, the widely studied theoretical PRAM model does not consider the cost of spawning and synchronizing threads, and as a result, algorithms achieving optimal performance bounds in the PRAM model may not be optimal in the binary-forking model. Often, algorithms need to be redesigned to achieve optimal performance bounds in the binary-forking model and the non-constant synchronization cost makes the task challenging. In this paper, we show that in the binary-forking model we can achieve optimal or near-optimal span with negligible or no asymptotic blowup in work for comparison-based sorting, Strassen's matrix multiplication (MM), and the Fast Fourier Transform (FFT). Our major results are as follows: (1) A randomized comparison-based sorting algorithm with optimal O(log n) span and O(nlog n) work, both w.h.p. in n. (2) An optimal O(log n) span algorithm for Strassen's matrix multiplication (MM) with only a loglog n -factor blow-up in work as well as a near-optimal O(log n loglog log n) span algorithm with no asymptotic blow-up in work. (3) A near-optimal O(log n logloglog n) span Fast Fourier Transform (FFT) algorithm with less than a log n-factor blow-up in work for all practical values of n (i.e., n le 10 ^10,000 ).

...read moreread less

4 citations

Journal Article•DOI•

Parallel Divide-and-Conquer Algorithms for Bubble Sort, Selection Sort and Insertion Sort

[...]

Pramod Ganapathi¹, Rezaul Chowdhury¹•Institutions (1)

Stony Brook University¹

02 Aug 2021-The Computer Journal

TL;DR: This work presents efficient parallel recursive divide-and-conquer algorithms for bubble sort, selection sort, and insertion sort that have excellent data locality and are highly parallel.

...read moreread less

Abstract: We present efficient parallel recursive divide-and-conquer algorithms for bubble sort, selection sort, and insertion sort. Our algorithms have excellent data locality and are highly parallel. The computational complexity of our insertion sort is ${{\\mathcal{O}}}\\left ({n^{\\log _2 3}}\\right )$ in contrast to ${{\\mathcal{O}}}\\left ({n^2}\\right )$ of standard insertion sort.

...read moreread less

4 citations

Proceedings Article•DOI•

Understanding Recursive Divide-and-Conquer Dynamic Programs in Fork-Join and Data-Flow Execution Models

[...]

Poornima Nookala¹, Zafar Ahmad², Mohammad Mahdi Javanmard², Martin Kong³, Rezaul Chowdhury², Robert W. Harrison² - Show less +2 more•Institutions (3)

Illinois Institute of Technology¹, Stony Brook University², University of Oklahoma³

01 Jun 2021

TL;DR: In this article, the performance of data-flow implementations of recursive divide-and-conquer based DP algorithms compare with fork-join implementations on shared-memory multicore machines, and the results confirm that a dataflow based implementation outperforms its fork- join based counterpart when due to artificial dependencies, the fork-Join implementation fails to generate enough subtasks to keep all processors busy and does not have enough data locality to compensate for the lost performance.

...read moreread less

Abstract: On shared-memory multicore machines, classic two-way recursive divide-and-conquer algorithms are implemented using common fork-join based parallel programming paradigms such as Intel Cilk+ or OpenMP. However, in such parallel paradigms, the use of joins for synchronization may lead to artificial dependencies among function calls which are not implied by the underlying DP recurrence. These artificial dependencies can increase the span asymptotically and thus reduce parallelism. From a practical perspective, they can lead to resource underutilization, i.e., threads becoming idle. To eliminate such artificial dependencies, task-based runtime systems and data-flow parallel paradigms, such as Concurrent Collections (CnC), PaRSEC, and Legion have been introduced. Such parallel paradigms and runtime systems overcome the limitations of fork-join parallelism by specifying data dependencies at a finer granularity and allowing tasks to execute as soon as dependencies are satisfied.In this paper, we investigate how the performance of data-flow implementations of recursive divide-and-conquer based DP algorithms compare with fork-join implementations. We have designed and implemented data-flow versions of DP algorithms in Intel CnC and compared the performance with fork-join based implementations in OpenMP. Considering different execution parameters (e.g., algorithmic properties such as recursive base size as well as machine configuration such as the number of physical cores, etc), our results confirm that a data-flow based implementation outperforms its fork-join based counter-part when due to artificial dependencies, the fork-join implementation fails to generate enough subtasks to keep all processors busy and does not have enough data locality to compensate for the lost performance. This phenomena happens when the input size of the DP algorithm is small or we have a huge number of compute cores in the system. As a result, with a fixed computation resource, moving from small input to larger input, fork-join implementation of DP algorithms outperforms the corresponding data-flow implementation. However, for a fixed size problem, moving the computation to a compute node with a larger number of cores, data-flow implementation outperforms the corresponding fork-join implementation.

...read moreread less

3 citations

DOI•

Side-chain Packing Using SE(3)-Transformer

[...]

Akhil Jindal¹, Sergei Kotelnikov¹, Dzmitry Padhorny¹, Dima Kozakov¹, Yimin Zhu¹, Rezaul Chowdhury¹, Sandor Vajda² - Show less +3 more•Institutions (2)

Stony Brook University¹, Boston University²

01 Dec 2021

2 citations

Book Chapter•DOI•

Algorithm Design for Tensor Units

[...]

Rezaul Chowdhury¹, Francesco Silvestri², Flavio Vella³•Institutions (3)

Stony Brook University¹, University of Padua², Free University of Bozen-Bolzano³

27 Aug 2021

TL;DR: In this paper, the authors propose a computational model, named the TCU model, that captures the ability to natively multiply small matrices and use it for designing fast algorithms for several problems, including matrix operations (dense and sparse multiplication, Gaussian elimination), graph algorithms (transitive closure, all pairs shortest distances), Discrete Fourier Transform, stencil computations, integer multiplication, and polynomial evaluation.

...read moreread less

Abstract: To respond to the intense computational load of deep neural networks, a plethora of domain-specific architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these architectures is a hardware circuit for efficiently computing a dense matrix multiplication of a given small size. In order to broaden the class of algorithms that exploit these systems, we propose a computational model, named the TCU model, that captures the ability to natively multiply small matrices. We then use the TCU model for designing fast algorithms for several problems, including matrix operations (dense and sparse multiplication, Gaussian Elimination), graph algorithms (transitive closure, all pairs shortest distances), Discrete Fourier Transform, stencil computations, integer multiplication, and polynomial evaluation. We finally highlight a relation between the TCU model and the external memory model.

...read moreread less

2 citations

Journal Article•DOI•

A Unified Framework to Discover Permutation Generation Algorithms

[...]

Pramod Ganapathi¹, Rezaul Chowdhury¹•Institutions (1)

Stony Brook University¹

17 Nov 2021-The Computer Journal

Posted Content•

Fast Stencil Computations using Fast Fourier Transforms

[...]

Zafar Ahmad¹, Rezaul Chowdhury¹, Rathish Das², Pramod Ganapathi¹, Aaron Gregory¹, Yimin Zhu¹ - Show less +2 more•Institutions (2)

Stony Brook University¹, University of Waterloo²

14 May 2021-arXiv: Data Structures and Algorithms

TL;DR: In this paper, the authors present two efficient parallel algorithms for performing linear stencil computations using Fast Fourier Transform (FFT) preconditioning on a Krylov subspace method.

...read moreread less

Abstract: Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The state-of-the-art techniques in this area fall into three groups: cache-aware tiled looping algorithms, cache-oblivious divide-and-conquer trapezoidal algorithms, and Krylov subspace methods. In this paper, we present two efficient parallel algorithms for performing linear stencil computations. Current direct solvers in this domain are computationally inefficient, and Krylov methods require manual labor and mathematical training. We solve these problems for linear stencils by using DFT preconditioning on a Krylov method to achieve a direct solver which is both fast and general. Indeed, while all currently available algorithms for solving general linear stencils perform $\Theta(NT)$ work, where $N$ is the size of the spatial grid and $T$ is the number of timesteps, our algorithms perform $o(NT)$ work. To the best of our knowledge, we give the first algorithms that use fast Fourier transforms to compute final grid data by evolving the initial data for many timesteps at once. Our algorithms handle both periodic and aperiodic boundary conditions, and achieve polynomially better performance bounds (i.e., computational complexity and parallel runtime) than all other existing solutions. Initial experimental results show that implementations of our algorithms that evolve grids of roughly $10^7$ cells for around $10^5$ timesteps run orders of magnitude faster than state-of-the-art implementations for periodic stencil problems, and 1.3$\times$ to 8.5$\times$ faster for aperiodic stencil problems.

...read moreread less