scispace - formally typeset
Search or ask a question

Showing papers by "Rezaul Chowdhury published in 2015"


Proceedings ArticleDOI
24 Jan 2015
TL;DR: Techniques are applied to a set of widely known dynamic programming problems, such as Floyd-Warshall's All-Pairs Shortest Paths, Stencil, and LCS, to remove the artificial dependency and preserve the cache-optimality by inheriting the DAC strategy.
Abstract: State-of-the-art cache-oblivious parallel algorithms for dynamic programming (DP) problems usually guarantee asymptotically optimal cache performance without any tuning of cache parameters, but they often fail to exploit the theoretically best parallelism at the same time. While these algorithms achieve cache-optimality through the use of a recursive divide-and-conquer (DAC) strategy, scheduling tasks at the granularity of task dependency introduces artificial dependencies in addition to those arising from the defining recurrence equations. We removed the artificial dependency by scheduling tasks ready for execution as soon as all its real dependency constraints are satisfied, while preserving the cache-optimality by inheriting the DAC strategy. We applied our approach to a set of widely known dynamic programming problems, such as Floyd-Warshall's All-Pairs Shortest Paths, Stencil, and LCS. Theoretical analyses show that our techniques improve the span of 2-way DAC-based Floyd Warshall's algorithm on an $n$ node graph from $Thn^2n$ to $Thn$, stencil computations on a $d$-dimensional hypercubic grid of width $w$ for $h$ time steps from $Th(d^2 h) w^ (d+2) - 1$ to $Thh$, and LCS on two sequences of length $n$ each from $Thn^_2 3$ to $Thn$. In each case, the total work and cache complexity remain asymptotically optimal. Experimental measurements exhibit a $3$ - $5$ times improvement in absolute running time, $10$ - $20$ times improvement in burdened span by Cilkview, and approximately the same L1/L2 cache misses by PAPI.

29 citations


Proceedings ArticleDOI
25 May 2015
TL;DR: This paper implements parallel CORDAC algorithms for four non-trivial DP problems, namely the parenthesization problem, Floyd-Warshall's all-pairs shortest path, sequence alignment with general gap penalty, protein accordion folding and the gap problem, and shows that the base cases of these algorithms are predominantly matrix-multiplication-like (MM-like) flexible kernels that expose many optimization opportunities not offered by traditional looping DP codes.
Abstract: Dynamic Programming (DP) problems arise in wide range of application areas spanning from logistics to computational biology. In this paper, we show how to obtain high-performing parallel implementations for a class of Problems by reducing them to highly utilizable flexible kernels through cache-oblivious recursive divide- and-conquer(CORDAC). We implement parallel CORDAC algorithms for four non-trivial DP problems, namely the parenthesization problem, Floyd-Warshall's all-pairs shortest path (FW-APSP), sequence alignment with general gap penalty (gap problem)and protein accordion folding. To the best of our knowledge our algorithms for protein accordion folding and the gap problem are novel. All four algorithms have asymptotically optimal cache performance, and all but FW-APSP have asymptotically more parallelism than their looping counterparts. We show that the base cases of our CORDAC algorithms are predominantly matrix-multiplication-like (MM-like) flexible kernels that expose many optimization opportunities not offered by traditional looping DP codes. As a result, one can obtain highly efficient DP implementations by optimizing those flexible kernels only. Our implementations achieve 5 -- 150× speedup over their standard loop based DP counterparts while consuming order-of-magnitude less energy on modern multicore machines with 16 -- 32 cores. We also compareour implementations with parallel tiled codes generated by existing polyhedral compilers: Polly, PoCC and PLuTo, and show that our implementations run significantly faster. Finally, we present results on manicures (Intel Xeon Phi) and clusters of multicores obtained using simple extensions for SIMD and shared-distributed-shared-memory architectures, respectively, demonstrating the versatility of our approach. Our optimization approach is highly systematic and suitable for automation.

17 citations


Proceedings ArticleDOI
09 Sep 2015
TL;DR: A hybrid method which simultaneously exploits both CPU and GPU cores to provide the best performance based on selected parameters of the approximation scheme is presented, which achieves more than two orders of magnitude speedup over serial computation for many of the molecular energetics terms.
Abstract: Motivation. Despite several reported acceleration successes of programmable GPUs (Graphics Processing Units) for molecular modeling and simulation tools, the general focus has been on fast computation with small molecules. This was primarily due to the limited memory size on the GPU. Moreover simultaneous use of CPU and GPU cores for a single kernel execution -- a necessity for achieving high parallelism -- has also not been fully considered. Results. We present fast computation methods for molecular mechanical (Lennard-Jones and Coulombic) and generalized Born solvation energetics which run on commodity multicore CPUs and manycore GPUs. The key idea is to trade off accuracy of pairwise, long-range atomistic energetics for higher speed of execution. A simple yet efficient CUDA kernel for GPU acceleration is presented which ensures high arithmetic intensity and memory efficiency. Our CUDA kernel uses a cache-friendly, recursive and linear-space octree data structure to handle very large molecular structures with up to several million atoms. Based on this CUDA kernel, we present a hybrid method which simultaneously exploits both CPU and GPU cores to provide the best performance based on selected parameters of the approximation scheme. Our CUDA kernels achieve more than two orders of magnitude speedup over serial computation for many of the molecular energetics terms. The hybrid method is shown to be able to achieve the best performance for all values of the approximation parameter. Availability. The source code and binaries are freely available as PMEOPA (Parallel Molecular Energetic using Octree Pairwise Approximation) and downloadable from http://cvcweb.ices.utexas.edu/software.

7 citations


Book ChapterDOI
10 Sep 2015
TL;DR: Compression of sequence read files is an important problem because the sequential order of the reads in these files typically conveys no biologically significant information, providing the freedom to reorder them so as to facilitate compression.
Abstract: New generation sequencing technologies produce massive data sets of millions of reads, making the compression of sequence read files an important problem. The sequential order of the reads in these files typically conveys no biologically significant information, providing the freedom to reorder them so as to facilitate compression. Similarly, for many problems the orientation of the reads (original or reverse complement) are indistinguishable from an information-theoretic perspective, providing the freedom to optimize the orientation of each read.

3 citations



Proceedings ArticleDOI
09 Sep 2015
TL;DR: The problem of computing DICA score of a given sequence of tRNAs is transformed to a polynomial multiplication problem, which can be solved in O(n log n) time using Fast Fourier Transform (FFT).
Abstract: The availability of synonymous codons (codons that can translate the same amino acid into protein) enables a protein to be encoded by many different sequences of codons/tRNAs. Autocorrelation measures the reuse of a particular codon/tRNA in succession (instead of choosing a different synonymous one) during the translation of a protein sequence. Studies show that tRNA autocorrelation in a coding sequence has important effects on its translation speed. Two different metrics available in literature to measure autocorrelation are: TPI (tRNA pairing index) and DICA (Distance Incorporated Codon Autocorrelation). TPI measures autocorrelation in sequences by counting successive transitions of tRNA usage, without considering how far apart they are in the sequence, whereas DICA measures autocorrelation by weighing the positional distance between codons in addition to the number of transitions. It has been shown that DICA correlates better to gene expression speed than TPI due to its incorporation of distance in the measure. The naive algorithm to compute DICA score takes time quadratic in the sequence length n which can be very expensive for long amino acid sequences. This motivates us to propose a faster algorithm for computing DICA. In this paper we show how to transform the problem of computing DICA score of a given sequence of tRNAs to a polynomial multiplication problem, which can then be solved in O(n log n) time using Fast Fourier Transform (FFT). The asymptotic reduction of complexity can improve performance of DICA computation significantly, especially for long sequences.