scispace - formally typeset
Z

Zhekai Zhang

Researcher at Massachusetts Institute of Technology

Publications -  15
Citations -  778

Zhekai Zhang is an academic researcher from Massachusetts Institute of Technology. The author has contributed to research in topics: Speedup & Computer science. The author has an hindex of 5, co-authored 9 publications receiving 495 citations.

Papers
More filters
Proceedings Article

Once for All: Train One Network and Specialize it for Efficient Deployment

TL;DR: In this paper, the authors propose to train a once-for-all network (OFA) that supports diverse architectural settings (depth, width, kernel size, and resolution) given a deployment scenario, and then select a specialized subnetwork by selecting from the OFA network without additional training.
Posted Content

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

TL;DR: SpAtten is presented, an efficient algorithm-architecture co-design that leverages token sparsity, head Sparsity, and quantization opportunities to reduce the attention computation and memory access and proposes the novel cascade token pruning to prune away unimportant tokens in the sentence.
Proceedings ArticleDOI

SpArch: Efficient Architecture for Sparse Matrix Multiplication

TL;DR: An efficient sparse matrix multiplication accelerator architecture, SpArch, which jointly optimizes the data locality for both input and output matrices is proposed, which reduces the total DRAM access by 2.8x over previous state-of-the-art.
Posted Content

SpArch: Efficient Architecture for Sparse Matrix Multiplication

TL;DR: SpArch as discussed by the authors proposes a streaming-based merge to reduce the number of partial matrices by three orders of magnitude and thus reduces DRAM access by 5.4x, and further develops a Huffman tree scheduler to improve the scalability of the merge for larger sparse matrices.
Proceedings ArticleDOI

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

TL;DR: SpAtten as discussed by the authors leverages token sparsity, head sparsity and quantization opportunities to reduce the attention computation and memory access in NLP NLP applications, and proposes cascade token pruning to prune away unimportant tokens in the sentence.