scispace - formally typeset
Search or ask a question
Author

Rezaul Chowdhury

Bio: Rezaul Chowdhury is an academic researcher from Stony Brook University. The author has contributed to research in topics: Cache & Cache-oblivious algorithm. The author has an hindex of 17, co-authored 83 publications receiving 1419 citations. Previous affiliations of Rezaul Chowdhury include University of Texas at Austin & Boston University.


Papers
More filters
Proceedings ArticleDOI
20 May 2013
TL;DR: It is shown that sometimes an optimistic parallelization approach can be used to avoid the use of locks and atomic instructions during dynamic load balancing, and two new types of high-performance lock free parallel BFS algorithms and their variants based on centralized job queues and distributed randomized work-stealing are implemented.
Abstract: Dynamic load-balancing in parallel algorithms typically requires locks and/or atomic instructions for correctness. We have shown that sometimes an optimistic parallelization approach can be used to avoid the use of locks and atomic instructions during dynamic load balancing. In this approach one allows potentially conflicting operations to run in parallel with the hope that everything will run without conflicts, and if any occasional inconsistencies arise due to conflicts, one will be able to handle them without hampering the overall correctness of the program. We have used this approach to implement two new types of high-performance lock free parallel BFS algorithms and their variants based on centralized job queues and distributed randomized work-stealing, respectively. These algorithms are implemented using Intel cilk++, and shown to be scalable and faster than two state-of-the-art multicore parallel BFS algorithms by Leiserson and Schardl (SPAA, 2010) and Hong et al. (PACT, 2011), where the algorithm described in the fast paper is also free of locks and atomic instructions but does not use optimistic parallelization. Our implementations can also handle scale-free graphs very efficiently which frequently arise in real-world scenarios such as the World Wide Web, social-networks, biological interaction networks, etc.

8 citations

Proceedings ArticleDOI
06 Jul 2020
TL;DR: The gap between cache-oblivious and cache-adaptive analysis is closed by showing how to make a smoothed analysis of cache- Adaptive algorithms via random reshuffling of memory fluctuations, and suggesting that cache- obliviousness is a solid foundation for achieving cache- adaptivity when the memory profile is not overly tailored to the algorithm structure.
Abstract: Cache-adaptive analysis was introduced to analyze the performance of an algorithm when the cache (or internal memory) available to the algorithm dynamically changes size. These memory-size fluctuations are, in fact, the common case in multi-core machines, where threads share cache and RAM. An algorithm is said to be efficiently cache-adaptive if it achieves optimal utilization of the dynamically changing cache. Cache-adaptive analysis was inspired by cache-oblivious analysis. Many (or even most) optimal cache-oblivious algorithms have an $(a,b,c)$-regular recursive structure. Such $(a, b, c)$-regular algorithms include Longest Common Subsequence, All Pairs Shortest Paths, Matrix Multiplication, Edit Distance, Gaussian Elimination Paradigm, etc. Bender et al. (2016) showed that some of these optimal cache-oblivious algorithms remain optimal even when cache changes size dynamically, but that in general they can be as much as logarithmic factor away from optimal. However, their analysis depends on constructing a highly structured, worst-case memory profile, or sequences of fluctuations in cache size. These worst-case profiles seem fragile, suggesting that the logarithmic gap may be an artifact of an unrealistically powerful adversary. We close the gap between cache-oblivious and cache-adaptive analysis by showing how to make a smoothed analysis of cache-adaptive algorithms via random reshuffling of memory fluctuations. Remarkably, we also show the limits of several natural forms of smoothing, including random perturbations of the cache size and randomizing the algorithm's starting time. Nonetheless, we show that if one takes an arbitrary profile and performs a random shuffle on when "significant events'' occur within the profile, then the shuffled profile becomes optimally cache-adaptive in expectation, even when the initial profile is adversarially constructed. These results suggest that cache-obliviousness is a solid foundation for achieving cache-adaptivity when the memory profile is not overly tailored to the algorithm structure.

8 citations

Proceedings ArticleDOI
06 Jul 2021
TL;DR: In this article, the authors present two efficient parallel algorithms for performing linear stencil computations using Fast Fourier Transform (DFT) preconditioning on a Krylov subspace method.
Abstract: Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The state-of-the-art techniques in this area fall into three groups: cache-aware tiled looping algorithms, cache-oblivious divide-and-conquer trapezoidal algorithms, and Krylov subspace methods. In this paper, we present two efficient parallel algorithms for performing linear stencil computations. Current direct solvers in this domain are computationally inefficient, and Krylov methods require manual labor and mathematical training. We solve these problems for linear stencils by using DFT preconditioning on a Krylov method to achieve a direct solver which is both fast and general. Indeed, while all currently available algorithms for solving general linear stencils perform Θ(NT) work, where N is the size of the spatial grid and T is the number of timesteps, our algorithms perform o(NT) work. To the best of our knowledge, we give the first algorithms that use fast Fourier transforms to compute final grid data by evolving the initial data for many timesteps at once. Our algorithms handle both periodic and aperiodic boundary conditions, and achieve polynomially better performance bounds (i.e., computational complexity and parallel runtime) than all other existing solutions. Initial experimental results show that implementations of our algorithms that evolve grids of roughly 10^7 cells for around 10^5 timesteps run orders of magnitude faster than state-of-the-art implementations for periodic stencil problems, and 1.3× to 8.5× faster for aperiodic stencil problems.

8 citations

01 Jan 2019
TL;DR: In this paper, the authors argue that the recursive divide-and-conquer paradigm is highly suited for designing algorithms to run efficiently under both shared-memory (multi-and manycores) and distributed-memory settings.
Abstract: We argue that the recursive divide-and-conquer paradigm is highly suited for designing algorithms to run efficiently under both shared-memory (multi- and manycores) and distributed-memory settings. The depth-first recursive decomposition of tasks and data is known to allow computations with potentially high temporal locality, and automatic adaptivity when resource availability (e.g., available space in shared caches) changes during runtime. Higher data locality leads to better intra-node I/O and cache performance and lower inter-node communication complexity, which in turn can reduce running times and energy consumption. Indeed, we show that a class of grid-based parallel recursive divide-and-conquer algorithms (for dynamic programs) can be run with provably optimal or near-optimal performance bounds on fat cores (cache complexity), thin cores (data movements), and purely distributed-memory machines (communication complexity) without changing the algorithm’s basic structure.

7 citations

Proceedings ArticleDOI
09 Sep 2015
TL;DR: A hybrid method which simultaneously exploits both CPU and GPU cores to provide the best performance based on selected parameters of the approximation scheme is presented, which achieves more than two orders of magnitude speedup over serial computation for many of the molecular energetics terms.
Abstract: Motivation. Despite several reported acceleration successes of programmable GPUs (Graphics Processing Units) for molecular modeling and simulation tools, the general focus has been on fast computation with small molecules. This was primarily due to the limited memory size on the GPU. Moreover simultaneous use of CPU and GPU cores for a single kernel execution -- a necessity for achieving high parallelism -- has also not been fully considered. Results. We present fast computation methods for molecular mechanical (Lennard-Jones and Coulombic) and generalized Born solvation energetics which run on commodity multicore CPUs and manycore GPUs. The key idea is to trade off accuracy of pairwise, long-range atomistic energetics for higher speed of execution. A simple yet efficient CUDA kernel for GPU acceleration is presented which ensures high arithmetic intensity and memory efficiency. Our CUDA kernel uses a cache-friendly, recursive and linear-space octree data structure to handle very large molecular structures with up to several million atoms. Based on this CUDA kernel, we present a hybrid method which simultaneously exploits both CPU and GPU cores to provide the best performance based on selected parameters of the approximation scheme. Our CUDA kernels achieve more than two orders of magnitude speedup over serial computation for many of the molecular energetics terms. The hybrid method is shown to be able to achieve the best performance for all values of the approximation parameter. Availability. The source code and binaries are freely available as PMEOPA (Parallel Molecular Energetic using Octree Pairwise Approximation) and downloadable from http://cvcweb.ices.utexas.edu/software.

7 citations


Cited by
More filters
01 May 1993
TL;DR: Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems.
Abstract: Three parallel algorithms for classical molecular dynamics are presented. The first assigns each processor a fixed subset of atoms; the second assigns each a fixed subset of inter-atomic forces to compute; the third assigns each a fixed spatial region. The algorithms are suitable for molecular dynamics models which can be difficult to parallelize efficiently—those with short-range forces where the neighbors of each atom change rapidly. They can be implemented on any distributed-memory parallel machine which allows for message-passing of data between independently executing processors. The algorithms are tested on a standard Lennard-Jones benchmark problem for system sizes ranging from 500 to 100,000,000 atoms on several parallel supercomputers--the nCUBE 2, Intel iPSC/860 and Paragon, and Cray T3D. Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems. For large problems, the spatial algorithm achieves parallel efficiencies of 90% and a 1840-node Intel Paragon performs up to 165 faster than a single Cray C9O processor. Trade-offs between the three algorithms and guidelines for adapting them to more complex molecular dynamics simulations are also discussed.

29,323 citations

Book
02 Jan 1991

1,377 citations

Proceedings ArticleDOI
16 Jun 2013
TL;DR: A systematic model of the tradeoff space fundamental to stencil pipelines is presented, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule are presented.
Abstract: Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. Because of their complex structure, the performance difference between a naive implementation of a pipeline and an optimized one is often an order of magnitude. Efficient implementations require optimization of both parallelism and locality, but due to the nature of stencils, there is a fundamental tension between parallelism, locality, and introducing redundant recomputation of shared values.We present a systematic model of the tradeoff space fundamental to stencil pipelines, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule. Combining this compiler with stochastic search over the space of schedules enables terse, composable programs to achieve state-of-the-art performance on a wide range of real image processing pipelines, and across different hardware architectures, including multicores with SIMD, and heterogeneous CPU+GPU execution. From simple Halide programs written in a few hours, we demonstrate performance up to 5x faster than hand-tuned C, intrinsics, and CUDA implementations optimized by experts over weeks or months, for image processing applications beyond the reach of past automatic compilers.

1,074 citations

Journal ArticleDOI
TL;DR: In this review, methods to adjust the polar solvation energy and to improve the performance of MM/PBSA and MM/GBSA calculations are reviewed and discussed and guidance is provided for practically applying these methods in drug design and related research fields.
Abstract: Molecular mechanics Poisson-Boltzmann surface area (MM/PBSA) and molecular mechanics generalized Born surface area (MM/GBSA) are arguably very popular methods for binding free energy prediction since they are more accurate than most scoring functions of molecular docking and less computationally demanding than alchemical free energy methods. MM/PBSA and MM/GBSA have been widely used in biomolecular studies such as protein folding, protein-ligand binding, protein-protein interaction, etc. In this review, methods to adjust the polar solvation energy and to improve the performance of MM/PBSA and MM/GBSA calculations are reviewed and discussed. The latest applications of MM/GBSA and MM/PBSA in drug design are also presented. This review intends to provide readers with guidance for practically applying MM/PBSA and MM/GBSA in drug design and related research fields.

822 citations

Journal ArticleDOI
TL;DR: Docking against homology-modeled targets also becomes possible for proteins whose structures are not known, and the druggability of the compounds and their specificity against a particular target can be calculated for further lead optimization processes.
Abstract: Molecular docking methodology explores the behavior of small molecules in the binding site of a target protein. As more protein structures are determined experimentally using X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy, molecular docking is increasingly used as a tool in drug discovery. Docking against homology-modeled targets also becomes possible for proteins whose structures are not known. With the docking strategies, the druggability of the compounds and their specificity against a particular target can be calculated for further lead optimization processes. Molecular docking programs perform a search algorithm in which the conformation of the ligand is evaluated recursively until the convergence to the minimum energy is reached. Finally, an affinity scoring function, ΔG [U total in kcal/mol], is employed to rank the candidate poses as the sum of the electrostatic and van der Waals energies. The driving forces for these specific interactions in biological systems aim toward complementarities between the shape and electrostatics of the binding site surfaces and the ligand or substrate.

817 citations