scispace - formally typeset
Search or ask a question
Author

Dip Sankar Banerjee

Bio: Dip Sankar Banerjee is an academic researcher from Indian Institutes of Information Technology. The author has contributed to research in topics: Computer science & Graph (abstract data type). The author has an hindex of 7, co-authored 28 publications receiving 134 citations. Previous affiliations of Dip Sankar Banerjee include Ohio State University & International Institute of Information Technology, Hyderabad.

Papers
More filters
Proceedings ArticleDOI
18 Dec 2011
TL;DR: This work uses a new model of multicore computing where the computation is performed simultaneously a control device, such as a CPU, and an acceleratorsuch as a GPU to address the issues related to the design of hybrid solutions.
Abstract: The advent of multicore and many-core architectures saw them being deployed to speed-up computations across several disciplines and application areas. Prominent examples include semi-numerical algorithms such as sorting, graph algorithms, image processing, scientific computations, and the like. In particular, using GPUs for general purpose computations has attracted a lot of attention given that GPUs can deliver more than one TFLOP of computing power at very low prices. In this work, we use a new model of multicore computing called hybrid multicore computing where the computation is performed simultaneously a control device, such as a CPU, and an accelerator such as a GPU. To this end, we use two case studies to explore the algorithmic and analytical issues in hybrid multicore computing. Our case studies involve two different ways of designing hybrid multicore algorithms. The main contribution of this paper is to address the issues related to the design of hybrid solutions. We show our hybrid algorithm for list ranking is faster by 50% compared to the best known implementation [Z. Wei, J. JaJa; IPDPS 2010]. Similarly, our hybrid algorithm for graph connected components is faster by 25% compared to the best known GPU implementation [26].

27 citations

Journal ArticleDOI
TL;DR: This paper introduces graph pruning as a technique that aims to reduce the size of the graph, and applies it on three fundamental graph algorithms: breadth first search (BFS), Connected Components (CC), and All Pairs Shortest Paths (APSP).

18 citations

Proceedings ArticleDOI
01 Dec 2013
TL;DR: This paper introduces graph pruning as a technique that aims to reduce the size of the graph, and applies it on three fundamental graph algorithms: breadth first search (BFS), Connected Components (CC), and All Pairs Shortest Paths (APSP).
Abstract: Graph algorithms play a prominent role in several fields of sciences and engineering. Notable among them are graph traversal, finding the connected components of a graph, and computing shortest paths. There are several efficient implementations of the above problems on a variety of modern multiprocessor architectures. It can be noticed in recent times that the size of the graphs that correspond to real world data sets has been increasing. Parallelism offers only a limited succor to this situation as current parallel architectures have severe short-comings when deployed for most graph algorithms. At the same time, these graphs are also getting very sparse in nature. This calls for particular work efficient solutions aimed at processing large, sparse graphs on modern parallel architectures. In this paper, we introduce graph pruning as a technique that aims to reduce the size of the graph. Certain elements of the graph can be pruned depending on the nature of the computation. Once a solution is obtained for the pruned graph, the solution is extended to the entire graph. We apply the above technique on three fundamental graph algorithms: breadth first search (BFS), Connected Components (CC), and All Pairs Shortest Paths (APSP). To validate our technique, we implement our algorithms on a heterogeneous platform consisting of a multicore CPU and a GPU. On this platform, we achieve an average of 35% improvement compared to state-ofthe-art solutions. Such an improvement has the potential to speed up other applications that rely on these algorithms.

17 citations

Proceedings ArticleDOI
01 Dec 2016
TL;DR: This work proposes CUDA Aware CNTK (CA-CNTK) which does low overhead communications and shows an average improvement of 23%, 21% and 15% per epoch for the popular CIFAR10, MNIST and ImageNet datasets, respectively.
Abstract: Deep learning frameworks have recently gained widespread popularity due to their highly accurate prediction capabilities and availability of low cost processors that can perform training over a large dataset quickly. Given the high core count in modern generation high performance computing systems, training deep networks over large data has now become practical. In this work, while targeting the Computational Network Toolkit (CNTK) framework, we propose new mechanisms and designs to boost the performance of the communications between GPU nodes. We perform thorough analysis of the different phases of the toolkit such as I/O, communications, and computation of CNTK to identify the different bottlenecks that can be potentially alleviated using the high performance capabilities provided by many CUDA aware MPI runtimes. Using a CUDA aware MPI library, we propose CUDA Aware CNTK (CA-CNTK) which does low overhead communications. Different datasets ranging from small to large sizes prove the advantage of our re-design, and how it can show similar results on deep learning frameworks having a similar execution pattern. Our designs show an average improvement of 23%, 21% and 15% per epoch for the popular CIFAR10, MNIST and ImageNet datasets, respectively.

14 citations

Proceedings ArticleDOI
07 Sep 2020
TL;DR: In this paper, the authors proposed a new approximate adder that employs a carry prediction method, which allows parallel propagation of the carry allowing faster calculations and also proposes a rectification logic which would enable higher accuracy for larger computations.
Abstract: Approximate computing has in recent times found significant applications towards lowering power, area, and time requirements for arithmetic operations. Several works done in recent years have furthered approximate computing along these directions. In this work, we propose a new approximate adder that employs a carry prediction method. This allows parallel propagation of the carry allowing faster calculations. In addition to the basic adder design, we also propose a rectification logic which would enable higher accuracy for larger computations. Experimental results show that our adder produces results 91.2% faster than the conventional ripple-carry adder. In terms of accuracy, the addition of rectification logic to the basic design produces results that are more accurate than state-of-the-art adders like SARA[13] and BCSA[5] by 74%.

13 citations


Cited by
More filters
Proceedings ArticleDOI
22 Jan 2006
TL;DR: Some of the major results in random graphs and some of the more challenging open problems are reviewed, including those related to the WWW.
Abstract: We will review some of the major results in random graphs and some of the more challenging open problems. We will cover algorithmic and structural questions. We will touch on newer models, including those related to the WWW.

7,116 citations

01 Jan 1978
TL;DR: This ebook is the first authorized digital version of Kernighan and Ritchie's 1988 classic, The C Programming Language (2nd Ed.), and is a "must-have" reference for every serious programmer's digital library.
Abstract: This ebook is the first authorized digital version of Kernighan and Ritchie's 1988 classic, The C Programming Language (2nd Ed.). One of the best-selling programming books published in the last fifty years, "K&R" has been called everything from the "bible" to "a landmark in computer science" and it has influenced generations of programmers. Available now for all leading ebook platforms, this concise and beautifully written text is a "must-have" reference for every serious programmers digital library. As modestly described by the authors in the Preface to the First Edition, this "is not an introductory programming manual; it assumes some familiarity with basic programming concepts like variables, assignment statements, loops, and functions. Nonetheless, a novice programmer should be able to read along and pick up the language, although access to a more knowledgeable colleague will help."

2,120 citations

Proceedings ArticleDOI
23 Feb 2013
TL;DR: This paper presents a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write and significantly more efficient than previously reported results using graph frameworks on machines with many more cores.
Abstract: There has been significant recent interest in parallel frameworks for processing graphs due to their applicability in studying social networks, the Web graph, networks in biology, and unstructured meshes in scientific simulation. Due to the desire to process large graphs, these systems have emphasized the ability to run on distributed memory machines. Today, however, a single multicore server can support more than a terabyte of memory, which can fit graphs with tens or even hundreds of billions of edges. Furthermore, for graph algorithms, shared-memory multicores are generally significantly more efficient on a per core, per dollar, and per joule basis than distributed memory systems, and shared-memory algorithms tend to be simpler than their distributed counterparts.In this paper, we present a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write. The framework has two very simple routines, one for mapping over edges and one for mapping over vertices. Our routines can be applied to any subset of the vertices, which makes the framework useful for many graph traversal algorithms that operate on subsets of the vertices. Based on recent ideas used in a very fast algorithm for breadth-first search (BFS), our routines automatically adapt to the density of vertex sets. We implement several algorithms in this framework, including BFS, graph radii estimation, graph connectivity, betweenness centrality, PageRank and single-source shortest paths. Our algorithms expressed using this framework are very simple and concise, and perform almost as well as highly optimized code. Furthermore, they get good speedups on a 40-core machine and are significantly more efficient than previously reported results using graph frameworks on machines with many more cores.

816 citations

Journal ArticleDOI
TL;DR: This article surveys Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency and reviews both discrete and fused CPU-GPU systems.
Abstract: As both CPUs and GPUs become employed in a wide range of applications, it has been acknowledged that both of these Processing Units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated a significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this article, we survey Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler, and application levels. Further, we review both discrete and fused CPU-GPU systems and discuss benchmark suites designed for evaluating Heterogeneous Computing Systems (HCSs). We believe that this article will provide insights into the workings and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.

414 citations

Proceedings ArticleDOI
20 Oct 2018
TL;DR: This paper sets out to reduce this significant communication cost by embedding data compression accelerators in the Network Interface Cards (NICs) and proposes an aggregator-free training algorithm that exchanges gradients in both legs of communication in the group, while the workers collectively perform the aggregation in a distributed manner.
Abstract: Training real-world Deep Neural Networks (DNNs) can take an eon (i.e., weeks or months) without leveraging distributed systems. Even distributed training takes inordinate time, of which a large fraction is spent in communicating weights and gradients over the network. State-of-the-art distributed training algorithms use a hierarchy of worker-aggregator nodes. The aggregators repeatedly receive gradient updates from their allocated group of the workers, and send back the updated weights. This paper sets out to reduce this significant communication cost by embedding data compression accelerators in the Network Interface Cards (NICs). To maximize the benefits of in-network acceleration, the proposed solution, named INCEPTIONN (In-Network Computing to Exchange and Process Training Information Of Neural Networks), uniquely combines hardware and algorithmic innovations by exploiting the following three observations. (1) Gradients are significantly more tolerant to precision loss than weights and as such lend themselves better to aggressive compression without the need for the complex mechanisms to avert any loss. (2) The existing training algorithms only communicate gradients in one leg of the communication, which reduces the opportunities for in-network acceleration of compression. (3) The aggregators can become a bottleneck with compression as they need to compress/decompress multiple streams from their allocated worker group. To this end, we first propose a lightweight and hardware-friendly lossy-compression algorithm for floating-point gradients, which exploits their unique value characteristics. This compression not only enables significantly reducing the gradient communication with practically no loss of accuracy, but also comes with low complexity for direct implementation as a hardware block in the NIC. To maximize the opportunities for compression and avoid the bottleneck at aggregators, we also propose an aggregator-free training algorithm that exchanges gradients in both legs of communication in the group, while the workers collectively perform the aggregation in a distributed manner. Without changing the mathematics of training, this algorithm leverages the associative property of the aggregation operator and enables our in-network accelerators to (1) apply compression for all communications, and (2) prevent the aggregator nodes from becoming bottlenecks. Our experiments demonstrate that INCEPTIONN reduces the communication time by 70.9~80.7% and offers 2.2~3.1x speedup over the conventional training system, while achieving the same level of accuracy.

77 citations