Proceedings ArticleDOI
A configurable cloud-scale DNN processor for real-time AI
Jeremy Fowers,Kalin Ovtcharov,Michael K. Papamichael,Todd Massengill,Ming Liu,Lo Daniel,Shlomi Alkalay,Michael Haselman,Logan Adams,Mahdi Ghandi,Stephen F. Heil,Prerak Patel,Adam Sapek,Gabriel Weisz,Lisa Woods,Sitaram Lanka,Steven K. Reinhardt,Adrian M. Caulfield,Eric S. Chung,Doug Burger +19 more
- pp 1-14
TLDR
This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI, and achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1.5 teraflops.Abstract:
Interactive AI-powered services require low-latency evaluation of deep neural network (DNN) models—aka ""real-time AI"". The growing demand for computationally expensive, state-of-the-art DNNs, coupled with diminishing performance gains of general-purpose architectures, has fueled an explosion of specialized Neural Processing Units (NPUs). NPUs for interactive services should satisfy two requirements: (1) execution of DNN models with low latency, high throughput, and high efficiency, and (2) flexibility to accommodate evolving state-of-the-art models (e.g., RNNs, CNNs, MLPs) without costly silicon updates. This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI. The Brainwave NPU achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1. The NPU attains this performance using a single-threaded SIMD ISA paired with a distributed microarchitecture capable of dispatching over 7M operations from a single instruction. The spatially distributed microarchitecture, scaled up to 96,000 multiply-accumulate units, is supported by hierarchical instruction decoders and schedulers coupled with thousands of independently addressable high-bandwidth on-chip memories, and can transparently exploit many levels of fine-grain SIMD parallelism. When targeting an FPGA, microarchitectural parameters such as native datapaths and numerical precision can be "synthesis specialized" to models at compile time, enabling atypically high FPGA performance competitive with hardened NPUs. When running on an Intel Stratix 10 280 FPGA, the Brainwave NPU achieves performance ranging from ten to over thirty-five teraflops, with no batching, on large, memory-intensive RNNs.read more
Citations
More filters
Journal ArticleDOI
Towards artificial general intelligence with hybrid Tianjic chip architecture.
Jing Pei,Lei Deng,Sen Song,Sen Song,Mingguo Zhao,Youhui Zhang,Shuang Wu,Guanrui Wang,Zhe Zou,Zhenzhi Wu,Wei He,Feng Chen,Ning Deng,Si Wu,Yu Wang,Yujie Wu,Z. Yang,Cheng Ma,Guoqi Li,Wentao Han,Huanglong Li,Huaqiang Wu,Rong Zhao,Yuan Xie,Luping Shi +24 more
TL;DR: The Tianjic chip is presented, which integrates neuroscience-oriented and computer-science-oriented approaches to artificial general intelligence to provide a hybrid, synergistic platform and is expected to stimulate AGI development by paving the way to more generalized hardware platforms.
Journal ArticleDOI
A new golden age for computer architecture
TL;DR: Innovations like domain-specific hardware, enhanced security, open instruction sets, and agile chip development will lead the way.
Journal ArticleDOI
A Survey on Green 6G Network: Architecture and Technologies
TL;DR: This survey presents a detailed survey on wireless evolution towards 6G networks, characterized by ubiquitous 3D coverage, introduction of pervasive AI and enhanced network protocol stack, and related potential technologies that are helpful in forming sustainable and socially seamless networks.
Proceedings ArticleDOI
SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training
Eric Qin,Ananda Samajdar,Hyoukjun Kwon,Vineet Nadella,Sudarshan Srinivasan,Dipankar Das,Bharat Kaul,Tushar Krishna +7 more
TL;DR: SIGMA is proposed, a flexible and scalable architecture that offers high utilization of all its processing elements (PEs) regardless of kernel shape and sparsity, and includes a novel reduction tree microarchitecture named Forwarding Adder Network (FAN).
Proceedings ArticleDOI
Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture
Yakun Sophia Shao,Jason Clemons,Rangharajan Venkatesan,Brian Zimmer,Matthew Fojtik,Nan Jiang,Ben Keller,Alicia Klinefelter,Nathaniel Pinckney,Priyanka Raina,Stephen G. Tell,Yanqing Zhang,William J. Dally,Joel Emer,C. Thomas Gray,Brucek Khailany,Stephen W. Keckler +16 more
TL;DR: This work investigates and quantifies the costs and benefits of using MCMs with fine-grained chiplets for deep learning inference, an application area with large compute and on-chip storage requirements, and introduces three tiling optimizations that improve data locality.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article
ImageNet Classification with Deep Convolutional Neural Networks
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Journal ArticleDOI
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Posted Content
Caffe: Convolutional Architecture for Fast Feature Embedding
Yangqing Jia,Evan Shelhamer,Jeff Donahue,Sergey Karayev,Jonathan Long,Ross Girshick,Sergio Guadarrama,Trevor Darrell +7 more
TL;DR: Caffe as discussed by the authors is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
Proceedings ArticleDOI
TensorFlow: a system for large-scale machine learning
Martín Abadi,Paul Barham,Jianmin Chen,Zhifeng Chen,Andy Davis,Jeffrey Dean,Matthieu Devin,Sanjay Ghemawat,Geoffrey Irving,Michael Isard,Manjunath Kudlur,Josh Levenberg,Rajat Monga,Sherry Moore,Derek G. Murray,Benoit Steiner,Paul A. Tucker,Vijay K. Vasudevan,Pete Warden,Martin Wicke,Yuan Yu,Xiaoqiang Zheng +21 more
TL;DR: TensorFlow as mentioned in this paper is a machine learning system that operates at large scale and in heterogeneous environments, using dataflow graphs to represent computation, shared state, and the operations that mutate that state.
Related Papers (5)
In-Datacenter Performance Analysis of a Tensor Processing Unit
Norman P. Jouppi,Cliff Young,Nishant Patil,David A. Patterson,Gaurav Agrawal,Raminder Bajwa,Sarah Bates,Suresh Bhatia,Nan Boden,Albert T. Borchers,Rick Boyle,Pierre-luc Cantin,Clifford Chao,Christopher Aaron Clark,Jeremy Coriell,Michael J. Daley,Matt Dau,Jeffrey Dean,Ben Gelb,Tara Vazir Ghaemmaghami,Rajendra Gottipati,William John Gulland,Robert Hagmann,C. Richard Ho,Doug Hogberg,John Hu,Robert Hundt,D. Hurt,Julian Ibarz,Aaron Jaffey,Alek Jaworski,Alexander Kaplan,Khaitan Harshit,Daniel Killebrew,Andy Koch,Naveen Kumar,Steve Lacy,James Laudon,James Law,Diemthu Le,Chris Leary,Zhuyuan Liu,Kyle Lucke,Alan Lundin,Gordon MacKean,Adriana Maggiore,Maire Mahony,Kieran Miller,Rahul Nagarajan,Ravi Narayanaswami,Ray Ni,Kathy Nix,Thomas Norrie,Mark Omernick,Narayana Penukonda,Andrew Everett Phelps,Jonathan Ross,Matt Ross,Amir Salek,Emad Samadiani,Chris Severn,Gregory Sizikov,Matthew Snelham,Jed Souter,Dan Steinberg,Andy Swing,Mercedes Tan,Gregory Michael Thorson,Bo Tian,Horia Toma,Erick Tuttle,Vijay K. Vasudevan,Richard Walter,Walter Wang,Eric Wilcox,Doe Hyun Yoon +75 more