S
Shibo Wang
Publications - 9
Citations - 2007
Shibo Wang is an academic researcher. The author has contributed to research in topics: Computer science & Language model. The author has an hindex of 7, co-authored 7 publications receiving 446 citations.
Papers
More filters
Posted Content
Conformer: Convolution-augmented Transformer for Speech Recognition
Anmol Gulati,James Qin,Chung-Cheng Chiu,Niki Parmar,Yu Zhang,Jiahui Yu,Wei Han,Shibo Wang,Zhengdong Zhang,Yonghui Wu,Ruoming Pang +10 more
TL;DR: This work proposes the convolution-augmented transformer for speech recognition, named Conformer, which significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.
Proceedings ArticleDOI
Conformer: Convolution-augmented Transformer for Speech Recognition
Anmol Gulati,James Qin,Chung-Cheng Chiu,Niki Parmar,Yu Zhang,Jiahui Yu,Wei Han,Shibo Wang,Zhengdong Zhang,Yonghui Wu,Ruoming Pang +10 more
TL;DR: Conformer as mentioned in this paper combines convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way, achieving state-of-the-art accuracies.
Posted Content
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition
Yu Zhang,Daniel S. Park,Wei Han,James Qin,Anmol Gulati,Joel Shor,Aren Jansen,Yuanzhong Xu,Yanping Huang,Shibo Wang,Zongwei Zhou,Bo Li,Min Ma,William Chan,Jiahui Yu,Yongqiang Wang,Liangliang Cao,Khe Chai Sim,Bhuvana Ramabhadran,Tara N. Sainath,Francoise Beaufays,Zhifeng Chen,Quoc V. Le,Chung-Cheng Chiu,Ruoming Pang,Yonghui Wu +25 more
TL;DR: In this article, the authors show that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data.
Posted Content
Scale MLPerf-0.6 models on Google TPU-v3 Pods.
Sameer Kumar,Victor Bitorff,Dehao Chen,Chiachen Chou,Blake A. Hechtman,HyoukJoong Lee,Naveen Kumar,Peter Mattson,Shibo Wang,Tao Wang,Yuanzhong Xu,Zongwei Zhou +11 more
TL;DR: This work discusses the optimizations and techniques including choice of optimizer, spatial partitioning and weight update sharding necessary to scale to 1024 TPU chips and identifies properties of models that make scaling them challenging, such as limited data parallelism and unscaled weights.
Posted Content
Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training
TL;DR: This paper presents an approach to automatically shard the weight update computation across replicas with efficient communication primitives and data formatting, using static analysis and transformations on the training computation graph, and achieves substantial speedups on typical image and language models on Cloud TPUs, requiring no change to model code.