Z
Zhekai Zhang
Researcher at Massachusetts Institute of Technology
Publications - 15
Citations - 778
Zhekai Zhang is an academic researcher from Massachusetts Institute of Technology. The author has contributed to research in topics: Speedup & Computer science. The author has an hindex of 5, co-authored 9 publications receiving 495 citations.
Papers
More filters
Proceedings Article
Once for All: Train One Network and Specialize it for Efficient Deployment
TL;DR: In this paper, the authors propose to train a once-for-all network (OFA) that supports diverse architectural settings (depth, width, kernel size, and resolution) given a deployment scenario, and then select a specialized subnetwork by selecting from the OFA network without additional training.
Posted Content
SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
TL;DR: SpAtten is presented, an efficient algorithm-architecture co-design that leverages token sparsity, head Sparsity, and quantization opportunities to reduce the attention computation and memory access and proposes the novel cascade token pruning to prune away unimportant tokens in the sentence.
Proceedings ArticleDOI
SpArch: Efficient Architecture for Sparse Matrix Multiplication
TL;DR: An efficient sparse matrix multiplication accelerator architecture, SpArch, which jointly optimizes the data locality for both input and output matrices is proposed, which reduces the total DRAM access by 2.8x over previous state-of-the-art.
Posted Content
SpArch: Efficient Architecture for Sparse Matrix Multiplication
TL;DR: SpArch as discussed by the authors proposes a streaming-based merge to reduce the number of partial matrices by three orders of magnitude and thus reduces DRAM access by 5.4x, and further develops a Huffman tree scheduler to improve the scalability of the merge for larger sparse matrices.
Proceedings ArticleDOI
SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
TL;DR: SpAtten as discussed by the authors leverages token sparsity, head sparsity and quantization opportunities to reduce the attention computation and memory access in NLP NLP applications, and proposes cascade token pruning to prune away unimportant tokens in the sentence.