scispace - formally typeset
Search or ask a question

Showing papers by "Rezaul Chowdhury published in 2019"


01 Jan 2019
TL;DR: In this paper, the authors argue that the recursive divide-and-conquer paradigm is highly suited for designing algorithms to run efficiently under both shared-memory (multi-and manycores) and distributed-memory settings.
Abstract: We argue that the recursive divide-and-conquer paradigm is highly suited for designing algorithms to run efficiently under both shared-memory (multi- and manycores) and distributed-memory settings. The depth-first recursive decomposition of tasks and data is known to allow computations with potentially high temporal locality, and automatic adaptivity when resource availability (e.g., available space in shared caches) changes during runtime. Higher data locality leads to better intra-node I/O and cache performance and lower inter-node communication complexity, which in turn can reduce running times and energy consumption. Indeed, we show that a class of grid-based parallel recursive divide-and-conquer algorithms (for dynamic programs) can be run with provably optimal or near-optimal performance bounds on fat cores (cache complexity), thin cores (data movements), and purely distributed-memory machines (communication complexity) without changing the algorithm’s basic structure.

7 citations


Book ChapterDOI
16 Jun 2019
TL;DR: It is shown that a class of grid-based parallel recursive divide-and-conquer algorithms can be run with provably optimal or near-optimal performance bounds on fat cores (cache complexity), thin cores (data movements), and purely distributed-memory machines (communication complexity) without changing the algorithm’s basic structure.
Abstract: We argue that the recursive divide-and-conquer paradigm is highly suited for designing algorithms to run efficiently under both shared-memory (multi- and manycores) and distributed-memory settings. The depth-first recursive decomposition of tasks and data is known to allow computations with potentially high temporal locality, and automatic adaptivity when resource availability (e.g., available space in shared caches) changes during runtime. Higher data locality leads to better intra-node I/O and cache performance and lower inter-node communication complexity, which in turn can reduce running times and energy consumption. Indeed, we show that a class of grid-based parallel recursive divide-and-conquer algorithms (for dynamic programs) can be run with provably optimal or near-optimal performance bounds on fat cores (cache complexity), thin cores (data movements), and purely distributed-memory machines (communication complexity) without changing the algorithm’s basic structure.

4 citations


Proceedings ArticleDOI
17 Jun 2019
TL;DR: It is argued that under reasonable conditions the races of a program can be captured by a directed acyclic graph (DAG), with nodes representing memory cells and arcs representing read-write dependencies between cells, and proves hardness of approximation for the general resource-time tradeoff problem.
Abstract: A determinacy race occurs if two or more logically parallel instructions access the same memory location and at least one of them tries to modify its content. Races are often undesirable as they can lead to nondeterministic and incorrect program behavior. A data race is a special case of a determinacy race which can be eliminated by associating a mutual-exclusion lock with the memory location in question or allowing atomic accesses to it. However, such solutions can reduce parallelism by serializing all accesses to that location. For associative and commutative updates to a memory cell, one can instead use a reducer, which allows parallel race-free updates at the expense of using some extra space. More extra space usually leads to more parallel updates, which in turn contributes to potentially lowering the overall execution time of the program. We start by asking the following question. Given a fixed budget of extra space for mitigating the cost of races in a parallel program, which memory locations should be assigned reducers and how should the space be distributed among those reducers in order to minimize the overall running time? We argue that under reasonable conditions the races of a program can be captured by a directed acyclic graph (DAG), with nodes representing memory cells and arcs representing read-write dependencies between cells. We then formulate our original question as an optimization problem on this DAG. We concentrate on a variation of this problem where space reuse among reducers is allowed by routing every unit of extra space along a (possibly different) source to sink path of the DAG and using it in the construction of multiple (possibly zero) reducers along the path. We consider two different ways of constructing a reducer and the corresponding duration functions (i.e., reduction time as a function of space budget). We generalize our race-avoiding space-time tradeoff problem to a discrete resource-time tradeoff problem with general non-increasing duration functions and resource reuse over paths of the given DAG. For general DAGs, we show that even if the entire DAG is available offline the problem is strongly NP-hard under all three duration functions, and we give approximation algorithms for solving the corresponding optimization problems. We also prove hardness of approximation for the general resource-time tradeoff problem and give a pseudo-polynomial time algorithm for series-parallel DAGs.

4 citations


Proceedings ArticleDOI
16 Feb 2019
TL;DR: This work extends recursive divide-&-conquer algorithms to run efficiently also on manycore GPUs and distributed-memory machines without changing their basic structure, and shows that these algorithms are work-optimal and have low latency and bandwidth bounds.
Abstract: Recursive divide-&-conquer algorithms are known for solving dynamic programming (DP) problems efficiently on shared-memory multicore machines. In this work, we extend them to run efficiently also on manycore GPUs and distributed-memory machines without changing their basic structure. Our GPU algorithms work efficiently even when the data is too large to fit into the host RAM. These are external-memory algorithms based on recursive r-way divide and conquer, where r (≥ 2) varies based on the current depth of the recursion. Our distributed-memory algorithms are also based on multi-way recursive divide and conquer that extends naturally inside each shared-memory multicore/manycore compute node. We show that these algorithms are work-optimal and have low latency and bandwidth bounds. We also report empirical results for our algorithms.

3 citations


Proceedings ArticleDOI
TL;DR: In this article, the authors propose to use a directed acyclic graph (DAG) to minimize the cost of data races in a parallel program, and give a pseudo-polynomial time algorithm for series-parallel DAG.
Abstract: A determinacy race occurs if two or more logically parallel instructions access the same memory location and at least one of them tries to modify its content. Races often lead to nondeterministic and incorrect program behavior. A data race is a special case of a determinacy race which can be eliminated by associating a mutual-exclusion lock or allowing atomic accesses to the memory location. However, such solutions can reduce parallelism by serializing all accesses to that location. For associative and commutative updates, reducers allow parallel race-free updates at the expense of using some extra space. We ask the following question. Given a fixed budget of extra space to mitigate the cost of races in a parallel program, which memory locations should be assigned reducers and how should the space be distributed among the reducers in order to minimize the overall running time? We argue that the races can be captured by a directed acyclic graph (DAG), with nodes representing memory cells and arcs representing read-write dependencies between cells. We then formulate our optimization problem on DAGs. We concentrate on a variation of this problem where space reuse among reducers is allowed by routing extra space along a source to sink path of the DAG and using it in the construction of reducers along the path. We consider two reducers and the corresponding duration functions (i.e., reduction time as a function of space budget). We generalize our race-avoiding space-time tradeoff problem to a discrete resource-time tradeoff problem with general non-increasing duration functions and resource reuse over paths. For general DAGs, the offline problem is strongly NP-hard under all three duration functions, and we give approximation algorithms. We also prove hardness of approximation for the general resource-time tradeoff problem and give a pseudo-polynomial time algorithm for series-parallel DAGs.

3 citations


Posted Content
TL;DR: In this paper, the authors propose a computational model, named the TCU model, that captures the ability to natively multiply small matrices and use it for several problems, including matrix operations (dense and sparse multiplication, Gaussian elimination), graph algorithms (transitive closure, all pairs shortest distances), Discrete Fourier Transform, stencil computations, integer multiplication, and polynomial evaluation.
Abstract: To respond to the need of efficient training and inference of deep neural networks, a plethora of domain-specific hardware architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these architectures is a hardware circuit for efficiently computing a dense matrix multiplication of a given small size. In order to broaden the class of algorithms that exploit these systems, we propose a computational model, named the TCU model, that captures the ability to natively multiply small matrices. We then use the TCU model for designing fast algorithms for several problems, including matrix operations (dense and sparse multiplication, Gaussian Elimination), graph algorithms (transitive closure, all pairs shortest distances), Discrete Fourier Transform, stencil computations, integer multiplication, and polynomial evaluation. We finally highlight a relation between the TCU model and the external memory model.

2 citations


Book ChapterDOI
12 Sep 2019
TL;DR: The objective is to design an interference-free schedule that minimizes the maximum weighted refresh time among all edges, where the refresh time of an edge is the maximum number of time slots between two successive slots of that edge and the weights reflect given priorities.
Abstract: Current wireless networks mainly focus on delay-tolerant applications while demands for latency-sensitive applications are rising with VR/AR technologies and machine-to-machine IoT applications. In this paper we consider multi-channel, multi-radio scheduling at the MAC layer to optimize for the performance of prioritized, delay-sensitive demands. Our objective is to design an interference-free schedule that minimizes the maximum weighted refresh time among all edges, where the refresh time of an edge is the maximum number of time slots between two successive slots of that edge and the weights reflect given priorities. In the single-antenna unweighted case with k channels and n transceivers, the scheduling problem reduces to the classical edge coloring problem when \(k \ge \lfloor n/2 \rfloor \) and to strong edge coloring when \(k=1\), but it is neither edge coloring nor strong edge coloring for general k. Further, the priority requirement introduces extra challenges. In this paper we provide a randomized algorithm with an approximation factor of \(\tilde{O}\left( \max \left\{ \sqrt{\varDelta _p }, \frac{\varDelta _p}{\sqrt{k}} \right\} \log m \right) \) in expectation, where \(\varDelta _p\) denotes the maximum degree of the unweighted multi-graph, which is formed by duplicating each edge \(e_i\) for \(w_i\) times (\(w_i\) is \(e_i\)’s integral priority value), and m is the number of required link communications (\(f(n) \in \tilde{O}(h(n))\) means that \(f(n) \in O\left( h(n) \log ^k(h(n)) \right) \) for some positive constant k. The results are generalized to the multi-antenna settings. We evaluate the performance of our methods in different settings using simulations).

1 citations