scispace - formally typeset
Y

Yibo Zhu

Researcher at Microsoft

Publications -  95
Citations -  4165

Yibo Zhu is an academic researcher from Microsoft. The author has contributed to research in topics: Computer science & Graphene. The author has an hindex of 29, co-authored 82 publications receiving 2955 citations. Previous affiliations of Yibo Zhu include Columbia University & Tsinghua University.

Papers
More filters
Proceedings ArticleDOI

Congestion Control for Large-Scale RDMA Deployments

TL;DR: DCQCN, an end-to-end congestion control scheme for RoCEv2, is introduced and it is shown that DCQCN dramatically improves throughput and fairness of Ro CEv2 RDMA traffic.
Proceedings ArticleDOI

Mirror mirror on the ceiling: flexible wireless links for data centers

TL;DR: 3D beamforming is proposed and evaluated, where 60 GHz signals bounce off data center ceilings, thus establishing indirect line-of-sight between any two racks in a data center, thus improving link range and number of concurrent transmissions in the data center.
Proceedings ArticleDOI

Packet-Level Telemetry in Large Datacenter Networks

TL;DR: This work presents Everflow, a packet-level network telemetry system for large DCNs, and presents experiments that demonstrate Everflow's scalability, and shares experiences of troubleshooting network faults gathered from running it for over 6 months in Microsoft's DCNs.
Proceedings ArticleDOI

A generic communication scheduler for distributed DNN training acceleration

TL;DR: This work introduces a unified abstraction and a Dependency Proxy mechanism to enable communication scheduling without breaking the original dependencies in framework engines, and introduces a Bayesian Optimization approach to auto-tune tensor partition size and other parameters for different training models under various networking conditions.
Proceedings Article

Tiresias: A {GPU} Cluster Manager for Distributed Deep Learning

TL;DR: Tiresias is presented, a GPU cluster manager tailored for distributed DL training jobs, which efficiently schedules and places DL jobs to reduce their job completion times (JCT), and its performance is comparable to that of solutions assuming perfect knowledge.