Home
/
Authors
/
Manjunath Gorentla Venkata

Author

Manjunath Gorentla Venkata

Other affiliations: University of New Mexico, University of California, Merced, Florida State University

Bio: Manjunath Gorentla Venkata is an academic researcher from Oak Ridge National Laboratory. The author has contributed to research in topics: Partitioned global address space & InfiniBand. The author has an hindex of 11, co-authored 61 publications receiving 413 citations. Previous affiliations of Manjunath Gorentla Venkata include University of New Mexico & University of California, Merced.

Papers published on a yearly basis

2020
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2006

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

UCX: An Open Source Framework for HPC Network APIs and Beyond

[...]

Pavel Shamis¹, Manjunath Gorentla Venkata¹, M. Graham Lopez¹, Matthew B. Baker¹, Oscar Hernandez¹, Yossi Itigin, Mike Dubman, Gilad Shainer, Richard L. Graham, Liran Liss, Yiftah Shahar, Sreeram Potluri², Davide Rossetti², Donald Becker², Duncan Poole², Christopher Lamb², Sameer Kumar³, Craig B. Stunkel³, George Bosilca⁴, Aurelien Bouteiller⁴ - Show less +16 more•Institutions (4)

Oak Ridge National Laboratory¹, Nvidia², IBM³, University of Tennessee⁴

26 Aug 2015

TL;DR: This paper presents Unified Communication X, a set of network APIs and their implementations for high throughput computing, and implements the APIs and protocols, and measures the performance of overhead-critical network primitives fundamental for implementing many parallel programming models and system libraries.

...read moreread less

Abstract: This paper presents Unified Communication X (UCX), a set of network APIs and their implementations for high throughput computing. UCX comes from the combined effort of national laboratories, industry, and academia to design and implement a high-performing and highly-scalable network stack for next generation applications and systems. UCX design provides the ability to tailor its APIs and network functionality to suit a wide variety of application domains and hardware. We envision these APIs to satisfy the networking needs of many programming models such as Message Passing Interface (MPI), OpenSHMEM, Partitioned Global Address Space (PGAS) languages, task-based paradigms and I/O bound applications. To evaluate the design we implement the APIs and protocols, and measure the performance of overhead-critical network primitives fundamental for implementing many parallel programming models and system libraries. Our results show that the latency, bandwidth, and message rate achieved by the portable UCX prototype is very close to that of the underlying driver. With UCX, we achieved a message exchange latency of 0.89 us, a bandwidth of 6138.5 MB/s, and a message rate of 14 million messages per second. As far as we know, this is the highest bandwidth and message rate achieved by any network stack (publicly known) on this hardware.

...read moreread less

129 citations

Journal Article•DOI•

A survey of MPI usage in the US exascale computing project

[...]

David E. Bernholdt¹, Swen Boehm¹, George Bosilca², Manjunath Gorentla Venkata¹, Ryan E. Grant³, Thomas Naughton¹, Howard Pritchard⁴, Martin Schulz⁵, Martin Schulz⁶, Geoffroy Vallée¹ - Show less +6 more•Institutions (6)

Oak Ridge National Laboratory¹, University of Tennessee², Sandia National Laboratories³, Los Alamos National Laboratory⁴, Lawrence Livermore National Laboratory⁵, Technische Universität München⁶

01 Jun 2018-Concurrency and Computation: Practice and Experience

49 citations

Proceedings Article•DOI•

Cheetah: A Framework for Scalable Hierarchical Collective Operations

[...]

Richard L. Graham¹, Manjunath Gorentla Venkata¹, Joshua Ladd¹, Pavel Shamis¹, Ishai Rabinovitz, Vasily Filipov, Gilad Shainer - Show less +3 more•Institutions (1)

Oak Ridge National Laboratory¹

23 May 2011

TL;DR: A new hierarchical collective communication framework is described that takes advantage of hardware-specific data-access mechanisms, and is flexible, with run-time hierarchy specification, and sharing of collective communication primitives between collective algorithms.

...read moreread less

Abstract: Collective communication operations, used by many scientific applications, tend to limit overall parallel application performance and scalability. Computer systems are becoming more heterogeneous with increasing node and core-per-node counts. Also, a growing number of data-access mechanisms, of varying characteristics, are supported within a single computer system. We describe a new hierarchical collective communication framework that takes advantage of hardware-specific data-access mechanisms. It is flexible, with run-time hierarchy specification, and sharing of collective communication primitives between collective algorithms. Data buffers are shared between levels in the hierarchy reducing collective communication management overhead. We have implemented several versions of the Message Passing Interface (MPI) collective operations, MPI Barrier() and MPI Bcast(), and run experiments using up to 49, 152 processes on a Cray XT5, and a small InfiniBand based cluster. At 49, 152 processes our barrier implementation outperforms the optimized native implementation by 75%. 32 Byte and one Mega-Byte broadcasts outperform it by 62% and 11%, respectively, with better scalability characteristics. Improvements relative to the default Open MPI implementation are much larger.

...read moreread less

35 citations

Proceedings Article•DOI•

GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM

[...]

Sreeram Potluri¹, Anshuman Goswami¹, Davide Rossetti¹, Chris J. Newburn¹, Manjunath Gorentla Venkata², Neena Imam² - Show less +2 more•Institutions (2)

Nvidia¹, Oak Ridge National Laboratory²

01 Dec 2017

TL;DR: This work evaluates different design alternatives for Mellanox InfiniBand adapters in CUDA, taking into consideration the relaxed memory model, automatic memory access coalescing and thread hierarchy on the GPU, and implements a 2dstencil application kernel using NVSHMEM.

...read moreread less

Abstract: GPUs have become an essential component for building compute clusters with high compute density and high performance per watt. As such clusters scale to have 1000s of GPUs, efficiently moving data between the GPUs becomes imperative to get maximum performance. NVSHMEM is an implementation of the OpenSHMEM standard for NVIDIA GPU clusters which allows communication to be issued from inside GPU kernels. In earlier work, we have shown how NVSHMEM can be used to achieve better application performance on GPUs connected through PCIe or NVLink. As part of this effort, we implement IB verbs for Mellanox InfiniBand adapters in CUDA. We evaluate different design alternatives, taking into consideration the relaxed memory model, automatic memory access coalescing and thread hierarchy on the GPU. We also consider correctness issues that arise in these designs. We take advantage of these designs transparently or through API extensions in NVSHMEM. With micro-benchmarks, we show that a Nvidia Pascal P100 GPU is able saturate the network bandwidth using only one or two of its 56 available streaming multiprocessors (SM). On a single GPU using a single IB EDR adapter, we achieve a throughput of around 90 million messages per second. In addition, we implement a 2dstencil application kernel using NVSHMEM and compare its performance with a CUDA-aware MPI-based implementation that uses GPUDirect RDMA. Speedups in the range of 23% to 42% are seen for input sizes large enough to fill the occupancy of Nvidia Pascal P100 GPUs on 2 to 4 nodes indicating that there are gains to be had by eliminating the CPU from the communication path when all computation runs on the GPU.

...read moreread less

23 citations

Proceedings Article•DOI•

ConnectX-2 CORE-Direct Enabled Asynchronous Broadcast Collective Communications

[...]

Manjunath Gorentla Venkata¹, Richard L. Graham¹, Joshua Ladd¹, Pavel Shamis¹, Ishai Rabinovitz, Vasily Filipov, Gilad Shainer - Show less +3 more•Institutions (1)

Oak Ridge National Laboratory¹

16 May 2011

TL;DR: A novel approach that fully offloads collective operations and employs only user-supplied buffers is described, which is better than production-grade Message Passing Interface (MPI) implementations and 150% better than the default Open MPI algorithm.

...read moreread less

Abstract: This paper describes the design and implementation of InfiniBand (IB) {CORE-textit{Direct}} based blocking and nonblocking broadcast operations within the Cheetah collective operation framework. It describes a novel approach that fully offloads collective operations and employs only user-supplied buffers. For a 64 rank communicator, the latency of {CORE-textit{Direct}} based hierarchical algorithm is better than production-grade Message Passing Interface (MPI) implementations, 150% better than the default Open MPI algorithm and 115% better than the shared memory optimized MVAPICH implementation for a one kilo-byte (KB) message, and for eight mega-bytes (MB) it is 48% and 64% better, respectively. Flat-topology broadcast achieves 99.9% overlap in a polling based communication-computation test, and 95.1% overlap for a wait based test, compared with 92.4% and 17.0%, respectively, for a similar Central Processing Unit (CPU) based implementation.

...read moreread less

22 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13

Collapse

Cited by

PDF

Open Access

More filters

Fast parallel algorithms for short-range molecular dynamics

[...]

Steven J. Plimpton¹•Institutions (1)

Sandia National Laboratories¹

01 May 1993

TL;DR: Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems.

...read moreread less

Abstract: Three parallel algorithms for classical molecular dynamics are presented. The first assigns each processor a fixed subset of atoms; the second assigns each a fixed subset of inter-atomic forces to compute; the third assigns each a fixed spatial region. The algorithms are suitable for molecular dynamics models which can be difficult to parallelize efficiently—those with short-range forces where the neighbors of each atom change rapidly. They can be implemented on any distributed-memory parallel machine which allows for message-passing of data between independently executing processors. The algorithms are tested on a standard Lennard-Jones benchmark problem for system sizes ranging from 500 to 100,000,000 atoms on several parallel supercomputers--the nCUBE 2, Intel iPSC/860 and Paragon, and Cray T3D. Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems. For large problems, the spatial algorithm achieves parallel efficiencies of 90% and a 1840-node Intel Paragon performs up to 165 faster than a single Cray C9O processor. Trade-offs between the three algorithms and guidelines for adapting them to more complex molecular dynamics simulations are also discussed.

...read moreread less

29,323 citations

Posted Content•DOI•

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program

[...]

Daniel Taliun¹, Daniel N. Harris², Michael D. Kessler², Jedidiah Carlson³ +191 more•Institutions (61)

06 Mar 2019-bioRxiv

TL;DR: The nearly complete catalog of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and non-coding sequence variants to phenotypic variation as well as resources and early insights from the sequence data.

...read moreread less

Abstract: Summary paragraph The Trans-Omics for Precision Medicine (TOPMed) program seeks to elucidate the genetic architecture and disease biology of heart, lung, blood, and sleep disorders, with the ultimate goal of improving diagnosis, treatment, and prevention. The initial phases of the program focus on whole genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here, we describe TOPMed goals and design as well as resources and early insights from the sequence data. The resources include a variant browser, a genotype imputation panel, and sharing of genomic and phenotypic data via dbGaP. In 53,581 TOPMed samples, >400 million single-nucleotide and insertion/deletion variants were detected by alignment with the reference genome. Additional novel variants are detectable through assembly of unmapped reads and customized analysis in highly variable loci. Among the >400 million variants detected, 97% have frequency

...read moreread less

662 citations

Journal Article•DOI•

Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence

[...]

Sebastian Raschka, Joshua Patterson, Corey J. Nolet

12 Feb 2020-Information-an International Interdisciplinary Journal

TL;DR: A comprehensive survey of machine learning with Python can be found in this article, where the authors cover widely-used libraries and concepts, collected together for holistic comparison, with the goal of educating the reader and driving the field of Python machine learning forward.

...read moreread less

Abstract: Smarter applications are making better use of the insights gleaned from data, having an impact on every industry and research discipline. At the core of this revolution lies the tools and the methods that are driving it, from processing the massive piles of data generated each day to learning from and taking useful action. Deep neural networks, along with advancements in classical machine learning and scalable general-purpose graphics processing unit (GPU) computing, have become critical components of artificial intelligence, enabling many of these astounding breakthroughs and lowering the barrier to adoption. Python continues to be the most preferred language for scientific computing, data science, and machine learning, boosting both performance and productivity by enabling the use of low-level libraries and clean high-level APIs. This survey offers insight into the field of machine learning with Python, taking a tour through important topics to identify some of the core hardware and software paradigms that have enabled it. We cover widely-used libraries and concepts, collected together for holistic comparison, with the goal of educating the reader and driving the field of Python machine learning forward.

...read moreread less

155 citations

Posted Content•

Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence.

[...]

Sebastian Raschka, Joshua Patterson, Corey J. Nolet

12 Feb 2020-arXiv: Learning

TL;DR: This survey offers insight into the field of machine learning with Python, taking a tour through important topics to identify some of the core hardware and software paradigms that have enabled it.

...read moreread less

Abstract: Smarter applications are making better use of the insights gleaned from data, having an impact on every industry and research discipline. At the core of this revolution lies the tools and the methods that are driving it, from processing the massive piles of data generated each day to learning from and taking useful action. Deep neural networks, along with advancements in classical ML and scalable general-purpose GPU computing, have become critical components of artificial intelligence, enabling many of these astounding breakthroughs and lowering the barrier to adoption. Python continues to be the most preferred language for scientific computing, data science, and machine learning, boosting both performance and productivity by enabling the use of low-level libraries and clean high-level APIs. This survey offers insight into the field of machine learning with Python, taking a tour through important topics to identify some of the core hardware and software paradigms that have enabled it. We cover widely-used libraries and concepts, collected together for holistic comparison, with the goal of educating the reader and driving the field of Python machine learning forward.

...read moreread less

138 citations

Journal Article•DOI•

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

[...]

Ang Li¹, Shuaiwen Leon Song¹, Jieyang Chen², Jiajia Li¹, Xu Liu³, Nathan R. Tallent¹, Kevin J. Barker¹ - Show less +3 more•Institutions (3)

Pacific Northwest National Laboratory¹, Oak Ridge National Laboratory², College of William & Mary³

01 Jan 2020-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A thorough evaluation on five latest types of modern GPU interconnects from six high-end servers and HPC platforms shows that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance.

...read moreread less

Abstract: High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the lack of deep understanding on how modern GPUs can be connected and the real impact of state-of-the-art interconnect technology on multi-GPU application performance become a hurdle. In this paper, we fill the gap by conducting a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI and NVSwitch, from six high-end servers and HPC platforms: NVIDIA P100-DGX-1, V100-DGX-1, DGX-2, OLCF's SummitDev and Summit supercomputers, as well as an SLI-linked system with two NVIDIA Turing RTX-2080 GPUs. Based on the empirical evaluation, we have observed four new types of GPU communication network NUMA effects: three are triggered by NVLink's topology, connectivity and routing, while one is caused by PCIe chipset design issue. These observations indicate that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance. Our evaluation can be leveraged in building practical multi-GPU performance models, which are vital for GPU task allocation, scheduling and migration in a shared environment (e.g., AI cloud and HPC centers), as well as communication-oriented performance tuning.

...read moreread less

118 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82

Collapse