Y
Yibo Zhu
Researcher at Microsoft
Publications - 95
Citations - 4165
Yibo Zhu is an academic researcher from Microsoft. The author has contributed to research in topics: Computer science & Graphene. The author has an hindex of 29, co-authored 82 publications receiving 2955 citations. Previous affiliations of Yibo Zhu include Columbia University & Tsinghua University.
Papers
More filters
Proceedings ArticleDOI
Congestion Control for Large-Scale RDMA Deployments
Yibo Zhu,Haggai Eran,Daniel Firestone,Chuanxiong Guo,Marina Lipshteyn,Yehonatan Liron,Jitendra Padhye,Shachar Raindel,Mohamad Haj Yahia,Ming Zhang +9 more
TL;DR: DCQCN, an end-to-end congestion control scheme for RoCEv2, is introduced and it is shown that DCQCN dramatically improves throughput and fairness of Ro CEv2 RDMA traffic.
Proceedings ArticleDOI
Mirror mirror on the ceiling: flexible wireless links for data centers
TL;DR: 3D beamforming is proposed and evaluated, where 60 GHz signals bounce off data center ceilings, thus establishing indirect line-of-sight between any two racks in a data center, thus improving link range and number of concurrent transmissions in the data center.
Proceedings ArticleDOI
Packet-Level Telemetry in Large Datacenter Networks
Yibo Zhu,Nanxi Kang,Jiaxin Cao,Albert Greenberg,Guohan Lu,Ratul Mahajan,David A. Maltz,Lihua Yuan,Ming Zhang,Ben Y. Zhao,Haitao Zheng +10 more
TL;DR: This work presents Everflow, a packet-level network telemetry system for large DCNs, and presents experiments that demonstrate Everflow's scalability, and shares experiences of troubleshooting network faults gathered from running it for over 6 months in Microsoft's DCNs.
Proceedings ArticleDOI
A generic communication scheduler for distributed DNN training acceleration
TL;DR: This work introduces a unified abstraction and a Dependency Proxy mechanism to enable communication scheduling without breaking the original dependencies in framework engines, and introduces a Bayesian Optimization approach to auto-tune tensor partition size and other parameters for different training models under various networking conditions.
Proceedings Article
Tiresias: A {GPU} Cluster Manager for Distributed Deep Learning
Juncheng Gu,Mosharaf Chowdhury,Kang G. Shin,Yibo Zhu,Myeongjae Jeon,Junjie Qian,Hongqiang Harry Liu,Chuanxiong Guo +7 more
TL;DR: Tiresias is presented, a GPU cluster manager tailored for distributed DL training jobs, which efficiently schedules and places DL jobs to reduce their job completion times (JCT), and its performance is comparable to that of solutions assuming perfect knowledge.