scispace - formally typeset
R

Rong Shi

Researcher at Ohio State University

Publications -  9
Citations -  188

Rong Shi is an academic researcher from Ohio State University. The author has contributed to research in topics: InfiniBand & CUDA. The author has an hindex of 8, co-authored 9 publications receiving 165 citations.

Papers
More filters
Proceedings ArticleDOI

Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters

TL;DR: This is the first study to propose efficient designs for GPU communication for small message sizes, using eager protocol, and experimental results demonstrate up to 59% and 63% reduction in latency for GPU- to-GPU and CPU-to-GPU point-to -point communications, respectively.
Proceedings ArticleDOI

Evaluating Scalability Bottlenecks by Workload Extrapolation

TL;DR: This paper extrapolates the workload to a bottleneck node and develops PatternMiner, a semi-automatic tool to identify how workload patterns change with scale, which is able to emulate a cluster of up to 60,000 nodes with only 8 physical machines to evaluate NameNode and Resource Manager.
Proceedings ArticleDOI

High performance MPI library over SR-IOV enabled infiniband clusters

TL;DR: This is the first study to offer a high performance MPI library that supports efficient locality aware MPI communication over SR-IOV enabled InfiniBand clusters and can significantly improve the performance for point-to-point and collective operations.
Proceedings ArticleDOI

HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters

TL;DR: This is the first attempt to propose a hybrid and adaptive solution to integrate all existing schemes to optimize arbitrary non-contiguous data movement using MPI data types on GPU clusters.
Proceedings ArticleDOI

A scalable and portable approach to accelerate hybrid HPL on heterogeneous CPU-GPU clusters

TL;DR: This paper presents a novel two-level workload partitioning approach for HPL that distributes workload based on the compute power of CPU/GPU nodes across the cluster and takes advantage of asynchronous kernel launches and CUDA copies to overlap computation and CPU-GPU data movement.