Showing papers by "Rezaul Chowdhury published in 2015"

PDF

Open Access

Proceedings Article•DOI•

Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency

[...]

Yuan Tang¹, Ronghui You¹, Haibin Kan¹, Jesmin Jahan Tithi², Pramod Ganapathi², Rezaul Chowdhury² - Show less +2 more•Institutions (2)

Fudan University¹, Stony Brook University²

24 Jan 2015

TL;DR: Techniques are applied to a set of widely known dynamic programming problems, such as Floyd-Warshall's All-Pairs Shortest Paths, Stencil, and LCS, to remove the artificial dependency and preserve the cache-optimality by inheriting the DAC strategy.

...read moreread less

Abstract: State-of-the-art cache-oblivious parallel algorithms for dynamic programming (DP) problems usually guarantee asymptotically optimal cache performance without any tuning of cache parameters, but they often fail to exploit the theoretically best parallelism at the same time. While these algorithms achieve cache-optimality through the use of a recursive divide-and-conquer (DAC) strategy, scheduling tasks at the granularity of task dependency introduces artificial dependencies in addition to those arising from the defining recurrence equations. We removed the artificial dependency by scheduling tasks ready for execution as soon as all its real dependency constraints are satisfied, while preserving the cache-optimality by inheriting the DAC strategy. We applied our approach to a set of widely known dynamic programming problems, such as Floyd-Warshall's All-Pairs Shortest Paths, Stencil, and LCS. Theoretical analyses show that our techniques improve the span of 2-way DAC-based Floyd Warshall's algorithm on an $n$ node graph from $Thn^2n$ to $Thn$, stencil computations on a $d$-dimensional hypercubic grid of width $w$ for $h$ time steps from $Th(d^2 h) w^ (d+2) - 1$ to $Thh$, and LCS on two sequences of length $n$ each from $Thn^_2 3$ to $Thn$. In each case, the total work and cache complexity remain asymptotically optimal. Experimental measurements exhibit a $3$ - $5$ times improvement in absolute running time, $10$ - $20$ times improvement in burdened span by Cilkview, and approximately the same L1/L2 cache misses by PAPI.

...read moreread less

29 citations

Proceedings Article•DOI•

High-Performance Energy-Efficient Recursive Dynamic Programming with Matrix-Multiplication-Like Flexible Kernels

[...]

Jesmin Jahan Tithi¹, Pramod Ganapathi¹, Aakrati Talati¹, Sonal Aggarwal¹, Rezaul Chowdhury¹ - Show less +1 more•Institutions (1)

Stony Brook University¹

25 May 2015

TL;DR: This paper implements parallel CORDAC algorithms for four non-trivial DP problems, namely the parenthesization problem, Floyd-Warshall's all-pairs shortest path, sequence alignment with general gap penalty, protein accordion folding and the gap problem, and shows that the base cases of these algorithms are predominantly matrix-multiplication-like (MM-like) flexible kernels that expose many optimization opportunities not offered by traditional looping DP codes.

...read moreread less

Abstract: Dynamic Programming (DP) problems arise in wide range of application areas spanning from logistics to computational biology. In this paper, we show how to obtain high-performing parallel implementations for a class of Problems by reducing them to highly utilizable flexible kernels through cache-oblivious recursive divide- and-conquer(CORDAC). We implement parallel CORDAC algorithms for four non-trivial DP problems, namely the parenthesization problem, Floyd-Warshall's all-pairs shortest path (FW-APSP), sequence alignment with general gap penalty (gap problem)and protein accordion folding. To the best of our knowledge our algorithms for protein accordion folding and the gap problem are novel. All four algorithms have asymptotically optimal cache performance, and all but FW-APSP have asymptotically more parallelism than their looping counterparts. We show that the base cases of our CORDAC algorithms are predominantly matrix-multiplication-like (MM-like) flexible kernels that expose many optimization opportunities not offered by traditional looping DP codes. As a result, one can obtain highly efficient DP implementations by optimizing those flexible kernels only. Our implementations achieve 5 -- 150× speedup over their standard loop based DP counterparts while consuming order-of-magnitude less energy on modern multicore machines with 16 -- 32 cores. We also compareour implementations with parallel tiled codes generated by existing polyhedral compilers: Polly, PoCC and PLuTo, and show that our implementations run significantly faster. Finally, we present results on manicures (Intel Xeon Phi) and clusters of multicores obtained using simple extensions for SIMD and shared-distributed-shared-memory architectures, respectively, demonstrating the versatility of our approach. Our optimization approach is highly systematic and suitable for automation.

...read moreread less

17 citations

Proceedings Article•DOI•

Accelerated molecular mechanical and solvation energetics on multicore CPUs and manycore GPUs

[...]

Deukhyun Cha, Qin Zhang¹, Jesmin Jahan Tithi², Alexander Rand³, Rezaul Chowdhury², Chandrajit L. Bajaj⁴ - Show less +2 more•Institutions (4)

CGG¹, State University of New York System², CD-adapco³, University of Texas at Austin⁴

09 Sep 2015

TL;DR: A hybrid method which simultaneously exploits both CPU and GPU cores to provide the best performance based on selected parameters of the approximation scheme is presented, which achieves more than two orders of magnitude speedup over serial computation for many of the molecular energetics terms.

...read moreread less

Abstract: Motivation. Despite several reported acceleration successes of programmable GPUs (Graphics Processing Units) for molecular modeling and simulation tools, the general focus has been on fast computation with small molecules. This was primarily due to the limited memory size on the GPU. Moreover simultaneous use of CPU and GPU cores for a single kernel execution -- a necessity for achieving high parallelism -- has also not been fully considered. Results. We present fast computation methods for molecular mechanical (Lennard-Jones and Coulombic) and generalized Born solvation energetics which run on commodity multicore CPUs and manycore GPUs. The key idea is to trade off accuracy of pairwise, long-range atomistic energetics for higher speed of execution. A simple yet efficient CUDA kernel for GPU acceleration is presented which ensures high arithmetic intensity and memory efficiency. Our CUDA kernel uses a cache-friendly, recursive and linear-space octree data structure to handle very large molecular structures with up to several million atoms. Based on this CUDA kernel, we present a hybrid method which simultaneously exploits both CPU and GPU cores to provide the best performance based on selected parameters of the approximation scheme. Our CUDA kernels achieve more than two orders of magnitude speedup over serial computation for many of the molecular energetics terms. The hybrid method is shown to be able to achieve the best performance for all values of the approximation parameter. Availability. The source code and binaries are freely available as PMEOPA (Parallel Molecular Energetic using Octree Pairwise Approximation) and downloadable from http://cvcweb.ices.utexas.edu/software.

...read moreread less

7 citations

Book Chapter•DOI•

Optimizing Read Reversals for Sequence Compression

[...]

Zhong Sichen¹, Lu Zhao¹, Yan Liang¹, Mohammadzaman Zamani¹, Rob Patro¹, Rezaul Chowdhury¹, Esther M. Arkin¹, Joseph S. B. Mitchell¹, Steven Skiena¹ - Show less +5 more•Institutions (1)

Stony Brook University¹

10 Sep 2015

TL;DR: Compression of sequence read files is an important problem because the sequential order of the reads in these files typically conveys no biologically significant information, providing the freedom to reorder them so as to facilitate compression.

...read moreread less

Abstract: New generation sequencing technologies produce massive data sets of millions of reads, making the compression of sequence read files an important problem. The sequential order of the reads in these files typically conveys no biologically significant information, providing the freedom to reorder them so as to facilitate compression. Similarly, for many problems the orientation of the reads (original or reverse complement) are indistinguishable from an information-theoretic perspective, providing the freedom to optimize the orientation of each read.

...read moreread less

3 citations

Proceedings Article•

Optimizing Read Reversals for Sequence Compression - (Extended Abstract).

[...]

Zhong Sichen¹, Lu Zhao¹, Yan Liang¹, Mohammadzaman Zamani¹, Rob Patro², Rezaul Chowdhury¹, Esther M. Arkin¹, Joseph S. B. Mitchell¹, Steven Skiena¹ - Show less +5 more•Institutions (2)

Stony Brook University¹, Carnegie Mellon University²

01 Jan 2015

1 citations

Proceedings Article•DOI•

Efficient computation of distance incorporated codon autocorrelation (DICA) score using fast Fourier transform

[...]

Jesmin Jahan Tithi¹, Rezaul Chowdhury¹•Institutions (1)

Stony Brook University¹

09 Sep 2015

TL;DR: The problem of computing DICA score of a given sequence of tRNAs is transformed to a polynomial multiplication problem, which can be solved in O(n log n) time using Fast Fourier Transform (FFT).

...read moreread less

Abstract: The availability of synonymous codons (codons that can translate the same amino acid into protein) enables a protein to be encoded by many different sequences of codons/tRNAs. Autocorrelation measures the reuse of a particular codon/tRNA in succession (instead of choosing a different synonymous one) during the translation of a protein sequence. Studies show that tRNA autocorrelation in a coding sequence has important effects on its translation speed. Two different metrics available in literature to measure autocorrelation are: TPI (tRNA pairing index) and DICA (Distance Incorporated Codon Autocorrelation). TPI measures autocorrelation in sequences by counting successive transitions of tRNA usage, without considering how far apart they are in the sequence, whereas DICA measures autocorrelation by weighing the positional distance between codons in addition to the number of transitions. It has been shown that DICA correlates better to gene expression speed than TPI due to its incorporation of distance in the measure. The naive algorithm to compute DICA score takes time quadratic in the sequence length n which can be very expensive for long amino acid sequences. This motivates us to propose a faster algorithm for computing DICA. In this paper we show how to transform the problem of computing DICA score of a given sequence of tRNAs to a polynomial multiplication problem, which can then be solved in O(n log n) time using Fast Fourier Transform (FFT). The asymptotic reduction of complexity can improve performance of DICA computation significantly, especially for long sequences.

...read moreread less