scispace - formally typeset
Search or ask a question
Author

Rezaul Chowdhury

Bio: Rezaul Chowdhury is an academic researcher from Stony Brook University. The author has contributed to research in topics: Cache & Cache-oblivious algorithm. The author has an hindex of 17, co-authored 83 publications receiving 1419 citations. Previous affiliations of Rezaul Chowdhury include University of Texas at Austin & Boston University.


Papers
More filters
Journal ArticleDOI
TL;DR: Experimental results show that, in the average case, with the exception of creation phase data movement, the algorithm outperforms min-max heap of Strothotte in all other aspects.
Abstract: In this paper, we present improved algorithms for min–max pair heaps introduced by S. Olariu et al. (A Mergeable Double-ended Priority Queue – The Comp. J. 34, 423–427, 1991). We also show that in the worst case, this structure, though slightly costlier to create, is better than min–max heaps of Strothotte (Min–max Heaps and Generalized Priority Queues – CACM, 29(10), 996–1000, Oct, 1986) in respect of deletion, and is equally good for insertion when an improved technique using binary search is applied. Experimental results show that, in the average case, with the exception of creation phase data movement, our algorithm outperforms min–max heap of Strothotte in all other aspects.

3 citations

Proceedings ArticleDOI
TL;DR: In this article, the authors propose to use a directed acyclic graph (DAG) to minimize the cost of data races in a parallel program, and give a pseudo-polynomial time algorithm for series-parallel DAG.
Abstract: A determinacy race occurs if two or more logically parallel instructions access the same memory location and at least one of them tries to modify its content. Races often lead to nondeterministic and incorrect program behavior. A data race is a special case of a determinacy race which can be eliminated by associating a mutual-exclusion lock or allowing atomic accesses to the memory location. However, such solutions can reduce parallelism by serializing all accesses to that location. For associative and commutative updates, reducers allow parallel race-free updates at the expense of using some extra space. We ask the following question. Given a fixed budget of extra space to mitigate the cost of races in a parallel program, which memory locations should be assigned reducers and how should the space be distributed among the reducers in order to minimize the overall running time? We argue that the races can be captured by a directed acyclic graph (DAG), with nodes representing memory cells and arcs representing read-write dependencies between cells. We then formulate our optimization problem on DAGs. We concentrate on a variation of this problem where space reuse among reducers is allowed by routing extra space along a source to sink path of the DAG and using it in the construction of reducers along the path. We consider two reducers and the corresponding duration functions (i.e., reduction time as a function of space budget). We generalize our race-avoiding space-time tradeoff problem to a discrete resource-time tradeoff problem with general non-increasing duration functions and resource reuse over paths. For general DAGs, the offline problem is strongly NP-hard under all three duration functions, and we give approximation algorithms. We also prove hardness of approximation for the general resource-time tradeoff problem and give a pseudo-polynomial time algorithm for series-parallel DAGs.

3 citations

Proceedings ArticleDOI
01 Dec 2020
TL;DR: In this article, the authors propose a lightweight optimization approach for MapReduce systems to minimize the makespan for repetitive tasks involving a typical frequency distribution, where the authors analyze the observed frequency distribution for the given task so as to identify an optimal offset parameter to add in the hash function to minimize makespan.
Abstract: Load balancing of skewed data in MapReduce systems like Hadoop is a well-studied problem. Many heuristics already exist to improve the load balance of the reducers thereby reducing the overall execution time. In this paper, we propose a lightweight optimization approach for MapReduce systems to minimize the makespan for repetitive tasks involving a typical frequency distribution. Our idea is to analyze the observed frequency distribution for the given task so as to identify an optimal offset parameter $c$ to add in the hash function to minimize makespan. For two different bucketing methods - modulo labeling and consecutive binning - we present efficient algorithms for finding the optimal value of $c$ . Finally, we present simulation results for both bucketing methods. The results vary with the data distribution and the number of reducers, but generally reduce makespan by 20% on average for power-law distributions, Results are confirmed with experiments on well-known real-world data sets.

2 citations

Journal ArticleDOI
TL;DR: It has been shown that the construction of a min-max pair heap with n elements requires at least 2.07 n element comparisons, and a new algorithm has been devised that lowers the upper bound to 2.43 n.
Abstract: In this paper, lower and upper bounds for min-max pair heap construction has been presented. It has been shown that the construction of a min-max pair heap with n elements requires at least 2.07 n element comparisons. A new algorithm for creating min-max pair heap has been devised that lowers the upper bound to 2.43 n .

2 citations

Proceedings ArticleDOI
01 May 2022
TL;DR: This paper outlines the design and implementation of the code generation approach in FOURST, to automatically generate FFT-based stencil solvers and discusses the performance profiles, and limitations, of both approaches on high-end modern hardware.
Abstract: Stencil computations are ubiquitous in modern grid-based physical simulations. In this paper, we present FOURST – a compiler to generate programs computing time iterated linear periodic and aperiodic stencil computations with fast Fourier transform methods. This paper outlines the design and implementation of the code generation approach in FOURST, to automatically generate FFT-based stencil solvers. We present experimental results on the state-of-the-art Ookami supercomputer housing Fujitsu A64FX and Intel Skylake processors, to study the performance of FOURST and a state-of-the-art tiling-based optimized code generator PLuTo on various stencil shapes and varying the number of time iterations. We discuss the performance profiles, and limitations, of both approaches on high-end modern hardware.

2 citations


Cited by
More filters
01 May 1993
TL;DR: Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems.
Abstract: Three parallel algorithms for classical molecular dynamics are presented. The first assigns each processor a fixed subset of atoms; the second assigns each a fixed subset of inter-atomic forces to compute; the third assigns each a fixed spatial region. The algorithms are suitable for molecular dynamics models which can be difficult to parallelize efficiently—those with short-range forces where the neighbors of each atom change rapidly. They can be implemented on any distributed-memory parallel machine which allows for message-passing of data between independently executing processors. The algorithms are tested on a standard Lennard-Jones benchmark problem for system sizes ranging from 500 to 100,000,000 atoms on several parallel supercomputers--the nCUBE 2, Intel iPSC/860 and Paragon, and Cray T3D. Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems. For large problems, the spatial algorithm achieves parallel efficiencies of 90% and a 1840-node Intel Paragon performs up to 165 faster than a single Cray C9O processor. Trade-offs between the three algorithms and guidelines for adapting them to more complex molecular dynamics simulations are also discussed.

29,323 citations

Book
02 Jan 1991

1,377 citations

Proceedings ArticleDOI
16 Jun 2013
TL;DR: A systematic model of the tradeoff space fundamental to stencil pipelines is presented, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule are presented.
Abstract: Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. Because of their complex structure, the performance difference between a naive implementation of a pipeline and an optimized one is often an order of magnitude. Efficient implementations require optimization of both parallelism and locality, but due to the nature of stencils, there is a fundamental tension between parallelism, locality, and introducing redundant recomputation of shared values.We present a systematic model of the tradeoff space fundamental to stencil pipelines, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule. Combining this compiler with stochastic search over the space of schedules enables terse, composable programs to achieve state-of-the-art performance on a wide range of real image processing pipelines, and across different hardware architectures, including multicores with SIMD, and heterogeneous CPU+GPU execution. From simple Halide programs written in a few hours, we demonstrate performance up to 5x faster than hand-tuned C, intrinsics, and CUDA implementations optimized by experts over weeks or months, for image processing applications beyond the reach of past automatic compilers.

1,074 citations

Journal ArticleDOI
TL;DR: In this review, methods to adjust the polar solvation energy and to improve the performance of MM/PBSA and MM/GBSA calculations are reviewed and discussed and guidance is provided for practically applying these methods in drug design and related research fields.
Abstract: Molecular mechanics Poisson-Boltzmann surface area (MM/PBSA) and molecular mechanics generalized Born surface area (MM/GBSA) are arguably very popular methods for binding free energy prediction since they are more accurate than most scoring functions of molecular docking and less computationally demanding than alchemical free energy methods. MM/PBSA and MM/GBSA have been widely used in biomolecular studies such as protein folding, protein-ligand binding, protein-protein interaction, etc. In this review, methods to adjust the polar solvation energy and to improve the performance of MM/PBSA and MM/GBSA calculations are reviewed and discussed. The latest applications of MM/GBSA and MM/PBSA in drug design are also presented. This review intends to provide readers with guidance for practically applying MM/PBSA and MM/GBSA in drug design and related research fields.

822 citations

Journal ArticleDOI
TL;DR: Docking against homology-modeled targets also becomes possible for proteins whose structures are not known, and the druggability of the compounds and their specificity against a particular target can be calculated for further lead optimization processes.
Abstract: Molecular docking methodology explores the behavior of small molecules in the binding site of a target protein. As more protein structures are determined experimentally using X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy, molecular docking is increasingly used as a tool in drug discovery. Docking against homology-modeled targets also becomes possible for proteins whose structures are not known. With the docking strategies, the druggability of the compounds and their specificity against a particular target can be calculated for further lead optimization processes. Molecular docking programs perform a search algorithm in which the conformation of the ligand is evaluated recursively until the convergence to the minimum energy is reached. Finally, an affinity scoring function, ΔG [U total in kcal/mol], is employed to rank the candidate poses as the sum of the electrostatic and van der Waals energies. The driving forces for these specific interactions in biological systems aim toward complementarities between the shape and electrostatics of the binding site surfaces and the ligand or substrate.

817 citations