Home
/
Authors
/
Dip Sankar Banerjee

Author

Dip Sankar Banerjee

Indian Institutes of Information Technology

Other affiliations: Ohio State University, International Institute of Information Technology, Hyderabad, Indian Institute of Technology, Jodhpur

Bio: Dip Sankar Banerjee is an academic researcher from Indian Institutes of Information Technology. The author has contributed to research in topics: Computer science & Graph (abstract data type). The author has an hindex of 7, co-authored 28 publications receiving 134 citations. Previous affiliations of Dip Sankar Banerjee include Ohio State University & International Institute of Information Technology, Hyderabad.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Hybrid algorithms for list ranking and graph connected components

[...]

Dip Sankar Banerjee¹, Kishore Kothapalli¹•Institutions (1)

International Institute of Information Technology, Hyderabad¹

18 Dec 2011

TL;DR: This work uses a new model of multicore computing where the computation is performed simultaneously a control device, such as a CPU, and an acceleratorsuch as a GPU to address the issues related to the design of hybrid solutions.

...read moreread less

Abstract: The advent of multicore and many-core architectures saw them being deployed to speed-up computations across several disciplines and application areas. Prominent examples include semi-numerical algorithms such as sorting, graph algorithms, image processing, scientific computations, and the like. In particular, using GPUs for general purpose computations has attracted a lot of attention given that GPUs can deliver more than one TFLOP of computing power at very low prices. In this work, we use a new model of multicore computing called hybrid multicore computing where the computation is performed simultaneously a control device, such as a CPU, and an accelerator such as a GPU. To this end, we use two case studies to explore the algorithmic and analytical issues in hybrid multicore computing. Our case studies involve two different ways of designing hybrid multicore algorithms. The main contribution of this paper is to address the issues related to the design of hybrid solutions. We show our hybrid algorithm for list ranking is faster by 50% compared to the best known implementation [Z. Wei, J. JaJa; IPDPS 2010]. Similarly, our hybrid algorithm for graph connected components is faster by 25% compared to the best known GPU implementation [26].

...read moreread less

27 citations

Journal Article•DOI•

Work efficient parallel algorithms for large graph exploration on emerging heterogeneous architectures

[...]

Dip Sankar Banerjee¹, Ashutosh Kumar¹, Meher Chaitanya¹, Shashank Sharma¹, Kishore Kothapalli¹ - Show less +1 more•Institutions (1)

International Institute of Information Technology, Hyderabad¹

01 Feb 2015-Journal of Parallel and Distributed Computing

TL;DR: This paper introduces graph pruning as a technique that aims to reduce the size of the graph, and applies it on three fundamental graph algorithms: breadth first search (BFS), Connected Components (CC), and All Pairs Shortest Paths (APSP).

...read moreread less

18 citations

Proceedings Article•DOI•

Work efficient parallel algorithms for large graph exploration

[...]

Dip Sankar Banerjee, Shashank Sharma, Kishore Kothapalli

01 Dec 2013

...read moreread less

Abstract: Graph algorithms play a prominent role in several fields of sciences and engineering. Notable among them are graph traversal, finding the connected components of a graph, and computing shortest paths. There are several efficient implementations of the above problems on a variety of modern multiprocessor architectures. It can be noticed in recent times that the size of the graphs that correspond to real world data sets has been increasing. Parallelism offers only a limited succor to this situation as current parallel architectures have severe short-comings when deployed for most graph algorithms. At the same time, these graphs are also getting very sparse in nature. This calls for particular work efficient solutions aimed at processing large, sparse graphs on modern parallel architectures. In this paper, we introduce graph pruning as a technique that aims to reduce the size of the graph. Certain elements of the graph can be pruned depending on the nature of the computation. Once a solution is obtained for the pruned graph, the solution is extended to the entire graph. We apply the above technique on three fundamental graph algorithms: breadth first search (BFS), Connected Components (CC), and All Pairs Shortest Paths (APSP). To validate our technique, we implement our algorithms on a heterogeneous platform consisting of a multicore CPU and a GPU. On this platform, we achieve an average of 35% improvement compared to state-ofthe-art solutions. Such an improvement has the potential to speed up other applications that rely on these algorithms.

...read moreread less

17 citations

Proceedings Article•DOI•

Re-Designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters

[...]

Dip Sankar Banerjee¹, Khaled Hamidouche¹, Dhabaleswar K. Panda¹•Institutions (1)

Ohio State University¹

01 Dec 2016

TL;DR: This work proposes CUDA Aware CNTK (CA-CNTK) which does low overhead communications and shows an average improvement of 23%, 21% and 15% per epoch for the popular CIFAR10, MNIST and ImageNet datasets, respectively.

...read moreread less

Abstract: Deep learning frameworks have recently gained widespread popularity due to their highly accurate prediction capabilities and availability of low cost processors that can perform training over a large dataset quickly. Given the high core count in modern generation high performance computing systems, training deep networks over large data has now become practical. In this work, while targeting the Computational Network Toolkit (CNTK) framework, we propose new mechanisms and designs to boost the performance of the communications between GPU nodes. We perform thorough analysis of the different phases of the toolkit such as I/O, communications, and computation of CNTK to identify the different bottlenecks that can be potentially alleviated using the high performance capabilities provided by many CUDA aware MPI runtimes. Using a CUDA aware MPI library, we propose CUDA Aware CNTK (CA-CNTK) which does low overhead communications. Different datasets ranging from small to large sizes prove the advantage of our re-design, and how it can show similar results on deep learning frameworks having a similar execution pattern. Our designs show an average improvement of 23%, 21% and 15% per epoch for the popular CIFAR10, MNIST and ImageNet datasets, respectively.

...read moreread less

14 citations

Proceedings Article•DOI•

An Approximate Carry Estimating Simultaneous Adder with Rectification

[...]

Rajat Bhattacharjya¹, Vishesh Mishra¹, Saurabh Singh¹, Kaustav Goswami¹, Dip Sankar Banerjee¹ - Show less +1 more•Institutions (1)

Indian Institutes of Information Technology¹

07 Sep 2020

TL;DR: In this paper, the authors proposed a new approximate adder that employs a carry prediction method, which allows parallel propagation of the carry allowing faster calculations and also proposes a rectification logic which would enable higher accuracy for larger computations.

...read moreread less

Abstract: Approximate computing has in recent times found significant applications towards lowering power, area, and time requirements for arithmetic operations. Several works done in recent years have furthered approximate computing along these directions. In this work, we propose a new approximate adder that employs a carry prediction method. This allows parallel propagation of the carry allowing faster calculations. In addition to the basic adder design, we also propose a rectification logic which would enable higher accuracy for larger computations. Experimental results show that our adder produces results 91.2% faster than the conventional ripple-carry adder. In terms of accuracy, the addition of rectification logic to the basic design produces results that are more accurate than state-of-the-art adders like SARA[13] and BCSA[5] by 74%.

...read moreread less

13 citations

1
2
3
4
…
5
6
7
8

Collapse

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

Random graphs

[...]

Alan Frieze¹•Institutions (1)

Carnegie Mellon University¹

22 Jan 2006

TL;DR: Some of the major results in random graphs and some of the more challenging open problems are reviewed, including those related to the WWW.

...read moreread less

Abstract: We will review some of the major results in random graphs and some of the more challenging open problems. We will cover algorithmic and structural questions. We will touch on newer models, including those related to the WWW.

...read moreread less

7,116 citations

The C programming language

[...]

Brian W. Kernighan¹, Dennis M. Ritchie¹•Institutions (1)

AT&T¹

01 Jan 1978

TL;DR: This ebook is the first authorized digital version of Kernighan and Ritchie's 1988 classic, The C Programming Language (2nd Ed.), and is a "must-have" reference for every serious programmer's digital library.

...read moreread less

Abstract: This ebook is the first authorized digital version of Kernighan and Ritchie's 1988 classic, The C Programming Language (2nd Ed.). One of the best-selling programming books published in the last fifty years, "K&R" has been called everything from the "bible" to "a landmark in computer science" and it has influenced generations of programmers. Available now for all leading ebook platforms, this concise and beautifully written text is a "must-have" reference for every serious programmers digital library. As modestly described by the authors in the Preface to the First Edition, this "is not an introductory programming manual; it assumes some familiarity with basic programming concepts like variables, assignment statements, loops, and functions. Nonetheless, a novice programmer should be able to read along and pick up the language, although access to a more knowledgeable colleague will help."

...read moreread less

2,120 citations

Proceedings Article•DOI•

Ligra: a lightweight graph processing framework for shared memory

[...]

Julian Shun¹, Guy E. Blelloch¹•Institutions (1)

Carnegie Mellon University¹

23 Feb 2013

TL;DR: This paper presents a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write and significantly more efficient than previously reported results using graph frameworks on machines with many more cores.

...read moreread less

Abstract: There has been significant recent interest in parallel frameworks for processing graphs due to their applicability in studying social networks, the Web graph, networks in biology, and unstructured meshes in scientific simulation. Due to the desire to process large graphs, these systems have emphasized the ability to run on distributed memory machines. Today, however, a single multicore server can support more than a terabyte of memory, which can fit graphs with tens or even hundreds of billions of edges. Furthermore, for graph algorithms, shared-memory multicores are generally significantly more efficient on a per core, per dollar, and per joule basis than distributed memory systems, and shared-memory algorithms tend to be simpler than their distributed counterparts.In this paper, we present a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write. The framework has two very simple routines, one for mapping over edges and one for mapping over vertices. Our routines can be applied to any subset of the vertices, which makes the framework useful for many graph traversal algorithms that operate on subsets of the vertices. Based on recent ideas used in a very fast algorithm for breadth-first search (BFS), our routines automatically adapt to the density of vertex sets. We implement several algorithms in this framework, including BFS, graph radii estimation, graph connectivity, betweenness centrality, PageRank and single-source shortest paths. Our algorithms expressed using this framework are very simple and concise, and perform almost as well as highly optimized code. Furthermore, they get good speedups on a 40-core machine and are significantly more efficient than previously reported results using graph frameworks on machines with many more cores.

...read moreread less

816 citations

Journal Article•DOI•

A Survey of CPU-GPU Heterogeneous Computing Techniques

[...]

Sparsh Mittal¹, Jeffrey S. Vetter¹•Institutions (1)

Oak Ridge National Laboratory¹

21 Jul 2015-ACM Computing Surveys

TL;DR: This article surveys Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency and reviews both discrete and fused CPU-GPU systems.

...read moreread less

Abstract: As both CPUs and GPUs become employed in a wide range of applications, it has been acknowledged that both of these Processing Units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated a significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this article, we survey Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler, and application levels. Further, we review both discrete and fused CPU-GPU systems and discuss benchmark suites designed for evaluating Heterogeneous Computing Systems (HCSs). We believe that this article will provide insights into the workings and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.

...read moreread less

414 citations

Proceedings Article•DOI•

A network-centric hardware/algorithm co-design to accelerate distributed training of deep neural networks

[...]

Youjie Li¹, Jongse Park², Mohammad Alian¹, Yifan Yuan¹, Zheng Qu³, Peitian Pan⁴, Ren Wang⁵, Alexander G. Schwing¹, Hadi Esmaeilzadeh⁶, Nam Sung Kim¹ - Show less +6 more•Institutions (6)

University of Illinois at Urbana–Champaign¹, Georgia Institute of Technology², Tsinghua University³, Shanghai Jiao Tong University⁴, Intel⁵, University of California, San Diego⁶

20 Oct 2018

TL;DR: This paper sets out to reduce this significant communication cost by embedding data compression accelerators in the Network Interface Cards (NICs) and proposes an aggregator-free training algorithm that exchanges gradients in both legs of communication in the group, while the workers collectively perform the aggregation in a distributed manner.

...read moreread less

Abstract: Training real-world Deep Neural Networks (DNNs) can take an eon (i.e., weeks or months) without leveraging distributed systems. Even distributed training takes inordinate time, of which a large fraction is spent in communicating weights and gradients over the network. State-of-the-art distributed training algorithms use a hierarchy of worker-aggregator nodes. The aggregators repeatedly receive gradient updates from their allocated group of the workers, and send back the updated weights. This paper sets out to reduce this significant communication cost by embedding data compression accelerators in the Network Interface Cards (NICs). To maximize the benefits of in-network acceleration, the proposed solution, named INCEPTIONN (In-Network Computing to Exchange and Process Training Information Of Neural Networks), uniquely combines hardware and algorithmic innovations by exploiting the following three observations. (1) Gradients are significantly more tolerant to precision loss than weights and as such lend themselves better to aggressive compression without the need for the complex mechanisms to avert any loss. (2) The existing training algorithms only communicate gradients in one leg of the communication, which reduces the opportunities for in-network acceleration of compression. (3) The aggregators can become a bottleneck with compression as they need to compress/decompress multiple streams from their allocated worker group. To this end, we first propose a lightweight and hardware-friendly lossy-compression algorithm for floating-point gradients, which exploits their unique value characteristics. This compression not only enables significantly reducing the gradient communication with practically no loss of accuracy, but also comes with low complexity for direct implementation as a hardware block in the NIC. To maximize the opportunities for compression and avoid the bottleneck at aggregators, we also propose an aggregator-free training algorithm that exchanges gradients in both legs of communication in the group, while the workers collectively perform the aggregation in a distributed manner. Without changing the mathematics of training, this algorithm leverages the associative property of the aggregation operator and enables our in-network accelerators to (1) apply compression for all communications, and (2) prevent the aggregator nodes from becoming bottlenecks. Our experiments demonstrate that INCEPTIONN reduces the communication time by 70.9~80.7% and offers 2.2~3.1x speedup over the conventional training system, while achieving the same level of accuracy.

...read moreread less

77 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

Collapse