TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
Mingyu Gao,Jing Pu,Xuan Yang,Mark Horowitz,Christos Kozyrakis +4 more
- Vol. 45, Iss: 1, pp 751-764
Reads0
Chats0
TLDR
The hardware architecture and software scheduling and partitioning techniques for TETRIS, a scalable NN accelerator using 3D memory, are presented and it is shown that despite the use of small SRAM buffers, the presence of3D memory simplifies dataflow scheduling for NN computations.Abstract:
The high accuracy of deep neural networks (NNs) has led to the development of NN accelerators that improve performance by two orders of magnitude. However, scaling these accelerators for higher performance with increasingly larger NNs exacerbates the cost and energy overheads of their memory systems, including the on-chip SRAM buffers and the off-chip DRAM channels.This paper presents the hardware architecture and software scheduling and partitioning techniques for TETRIS, a scalable NN accelerator using 3D memory. First, we show that the high throughput and low energy characteristics of 3D memory allow us to rebalance the NN accelerator design, using more area for processing elements and less area for SRAM buffers. Second, we move portions of the NN computations close to the DRAM banks to decrease bandwidth pressure and increase performance and energy efficiency. Third, we show that despite the use of small SRAM buffers, the presence of 3D memory simplifies dataflow scheduling for NN computations. We present an analytical scheduling scheme that matches the efficiency of schedules derived through exhaustive search. Finally, we develop a hybrid partitioning scheme that parallelizes the NN computations over multiple accelerators. Overall, we show that TETRIS improves mthe performance by 4.1x and reduces the energy by 1.5x over NN accelerators with conventional, low-power DRAM memory systems.read more
Citations
More filters
Journal ArticleDOI
Efficient Processing of Deep Neural Networks: A Tutorial and Survey
TL;DR: In this paper, the authors provide a comprehensive tutorial and survey about the recent advances toward the goal of enabling efficient processing of DNNs, and discuss various hardware platforms and architectures that support DNN, and highlight key trends in reducing the computation cost of deep neural networks either solely via hardware design changes or via joint hardware and DNN algorithm changes.
Proceedings ArticleDOI
SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks
Angshuman Parashar,Minsoo Rhu,Anurag Mukkara,Antonio Puglielli,Rangharajan Venkatesan,Brucek Khailany,Joel Emer,Stephen W. Keckler,William J. Dally +8 more
TL;DR: The Sparse CNN (SCNN) accelerator as discussed by the authors employs a dataflow that enables maintaining the sparse weights and activations in a compressed encoding, which eliminates unnecessary data transfers and reduces storage requirements.
Proceedings ArticleDOI
Bit fusion: bit-level dynamically composable architecture for accelerating deep neural networks
Hardik Sharma,Jongse Park,Naveen Suda,Liangzhen Lai,Benson Chau,Vikas Chandra,Hadi Esmaeilzadeh +6 more
TL;DR: This work designs Bit Fusion, a bit-flexible accelerator that constitutes an array of bit-level processing elements that dynamically fuse to match the bitwidth of individual DNN layers, and compares it to two state-of-the-art DNN accelerators, Eyeriss and Stripes.
Journal ArticleDOI
Scaling for edge inference of deep neural networks
TL;DR: There are increasing gaps between the computational complexity and energy efficiency required for the continued scaling of deep neural networks and the hardware capacity actually available with current CMOS technology scaling, in situations where edge inference is required.
Proceedings ArticleDOI
Timeloop: A Systematic Approach to DNN Accelerator Evaluation
Angshuman Parashar,Priyanka Raina,Yakun Sophia Shao,Yu-Hsin Chen,Victor A. Ying,Anurag Mukkara,Rangharajan Venkatesan,Brucek Khailany,Stephen W. Keckler,Joel Emer +9 more
TL;DR: Timeloop's underlying models and algorithms are described in detail and results from case studies enabled by Timeloop are shown, which reveal that dataflow and memory hierarchy co-design plays a critical role in optimizing energy efficiency.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article
ImageNet Classification with Deep Convolutional Neural Networks
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Journal ArticleDOI
Deep learning
TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.
Related Papers (5)
In-Datacenter Performance Analysis of a Tensor Processing Unit
Norman P. Jouppi,Cliff Young,Nishant Patil,David A. Patterson,Gaurav Agrawal,Raminder Bajwa,Sarah Bates,Suresh Bhatia,Nan Boden,Albert T. Borchers,Rick Boyle,Pierre-luc Cantin,Clifford Chao,Christopher Aaron Clark,Jeremy Coriell,Michael J. Daley,Matt Dau,Jeffrey Dean,Ben Gelb,Tara Vazir Ghaemmaghami,Rajendra Gottipati,William John Gulland,Robert Hagmann,C. Richard Ho,Doug Hogberg,John Hu,Robert Hundt,D. Hurt,Julian Ibarz,Aaron Jaffey,Alek Jaworski,Alexander Kaplan,Khaitan Harshit,Daniel Killebrew,Andy Koch,Naveen Kumar,Steve Lacy,James Laudon,James Law,Diemthu Le,Chris Leary,Zhuyuan Liu,Kyle Lucke,Alan Lundin,Gordon MacKean,Adriana Maggiore,Maire Mahony,Kieran Miller,Rahul Nagarajan,Ravi Narayanaswami,Ray Ni,Kathy Nix,Thomas Norrie,Mark Omernick,Narayana Penukonda,Andrew Everett Phelps,Jonathan Ross,Matt Ross,Amir Salek,Emad Samadiani,Chris Severn,Gregory Sizikov,Matthew Snelham,Jed Souter,Dan Steinberg,Andy Swing,Mercedes Tan,Gregory Michael Thorson,Bo Tian,Horia Toma,Erick Tuttle,Vijay K. Vasudevan,Richard Walter,Walter Wang,Eric Wilcox,Doe Hyun Yoon +75 more