A configurable cloud-scale DNN processor for real-time AI

doi:10.1109/ISCA.2018.00012

Proceedings ArticleDOI

A configurable cloud-scale DNN processor for real-time AI

- pp 1-14

TLDR

This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI, and achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1.5 teraflops.

Abstract:

Interactive AI-powered services require low-latency evaluation of deep neural network (DNN) models—aka ""real-time AI"". The growing demand for computationally expensive, state-of-the-art DNNs, coupled with diminishing performance gains of general-purpose architectures, has fueled an explosion of specialized Neural Processing Units (NPUs). NPUs for interactive services should satisfy two requirements: (1) execution of DNN models with low latency, high throughput, and high efficiency, and (2) flexibility to accommodate evolving state-of-the-art models (e.g., RNNs, CNNs, MLPs) without costly silicon updates. This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI. The Brainwave NPU achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1. The NPU attains this performance using a single-threaded SIMD ISA paired with a distributed microarchitecture capable of dispatching over 7M operations from a single instruction. The spatially distributed microarchitecture, scaled up to 96,000 multiply-accumulate units, is supported by hierarchical instruction decoders and schedulers coupled with thousands of independently addressable high-bandwidth on-chip memories, and can transparently exploit many levels of fine-grain SIMD parallelism. When targeting an FPGA, microarchitectural parameters such as native datapaths and numerical precision can be "synthesis specialized" to models at compile time, enabling atypically high FPGA performance competitive with hardened NPUs. When running on an Intel Stratix 10 280 FPGA, the Brainwave NPU achieves performance ranging from ten to over thirty-five teraflops, with no batching, on large, memory-intensive RNNs.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Towards artificial general intelligence with hybrid Tianjic chip architecture.

Jing Pei, +24 more

- 31 Jul 2019 -

Nature

TL;DR: The Tianjic chip is presented, which integrates neuroscience-oriented and computer-science-oriented approaches to artificial general intelligence to provide a hybrid, synergistic platform and is expected to stimulate AGI development by paving the way to more generalized hardware platforms.

...read moreread less

Journal ArticleDOI

A new golden age for computer architecture

John L. Hennessy, +1 more

- 28 Jan 2019 -

Communications of The ACM

TL;DR: Innovations like domain-specific hardware, enhanced security, open instruction sets, and agile chip development will lead the way.

...read moreread less

Journal ArticleDOI

A Survey on Green 6G Network: Architecture and Technologies

Tongyi Huang, +5 more

- 04 Dec 2019 -

IEEE Access

TL;DR: This survey presents a detailed survey on wireless evolution towards 6G networks, characterized by ubiquitous 3D coverage, introduction of pervasive AI and enhanced network protocol stack, and related potential technologies that are helpful in forming sustainable and socially seamless networks.

...read moreread less

Proceedings ArticleDOI

SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training

Eric Qin, +7 more

TL;DR: SIGMA is proposed, a flexible and scalable architecture that offers high utilization of all its processing elements (PEs) regardless of kernel shape and sparsity, and includes a novel reduction tree microarchitecture named Forwarding Adder Network (FAN).

...read moreread less

Proceedings ArticleDOI

Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture

Yakun Sophia Shao, +16 more

TL;DR: This work investigates and quantifies the costs and benefits of using MCMs with fine-grained chiplets for deep learning inference, an application area with large compute and on-chip storage requirements, and introduces three tiling optimizations that improve data locality.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Posted Content

Caffe: Convolutional Architecture for Fast Feature Embedding

Yangqing Jia, +7 more

- 20 Jun 2014 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Caffe as discussed by the authors is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

...read moreread less

Collapse

A configurable cloud-scale DNN processor for real-time AI

Citations

Towards artificial general intelligence with hybrid Tianjic chip architecture.

A new golden age for computer architecture

A Survey on Green 6G Network: Architecture and Technologies

SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training

Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture

References

Deep Residual Learning for Image Recognition

ImageNet Classification with Deep Convolutional Neural Networks

Long short-term memory

Caffe: Convolutional Architecture for Fast Feature Embedding

TensorFlow: a system for large-scale machine learning

Related Papers (5)

In-Datacenter Performance Analysis of a Tensor Processing Unit

Deep Residual Learning for Image Recognition

EIE: efficient inference engine on compressed deep neural network

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

ImageNet Classification with Deep Convolutional Neural Networks