In-Datacenter Performance Analysis of a Tensor Processing Unit

doi:10.1145/3079856.3080246

Open AccessProceedings ArticleDOI

In-Datacenter Performance Analysis of a Tensor Processing Unit

- Vol. 45, Iss: 2, pp 1-12

TLDR

The Tensor Processing Unit (TPU) as discussed by the authors is a custom ASIC deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) using a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS).

Abstract:

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

A network-centric hardware/algorithm co-design to accelerate distributed training of deep neural networks

Youjie Li, +9 more

TL;DR: This paper sets out to reduce this significant communication cost by embedding data compression accelerators in the Network Interface Cards (NICs) and proposes an aggregator-free training algorithm that exchanges gradients in both legs of communication in the group, while the workers collectively perform the aggregation in a distributed manner.

...read moreread less

Proceedings ArticleDOI

Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations

Nitish Srivastava, +5 more

TL;DR: This work proposes a hardware accelerator that can accelerate both dense and sparse tensor factorizations and co-designs the hardware and a sparse storage format, which allows accessing the sparse data in vectorized and streaming fashion and maximizes the utilization of the memory bandwidth.

...read moreread less

Posted Content

On the Opportunities and Risks of Foundation Models.

Rishi Bommasani, +113 more

- 16 Aug 2021 -

arXiv: Learning

TL;DR: The authors provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e. g.g. model architectures, training procedures, data, systems, security, evaluation, theory) to their applications.

...read moreread less

Posted Content

Memory-Efficient Pipeline-Parallel DNN Training

Deepak Narayanan, +4 more

- 16 Jun 2020 -

arXiv: Learning

TL;DR: This work proposes PipeDream-2BW, a system that performs memory-efficient pipeline parallelism, a hybrid form of parallelism that combines data and model parallelism with input pipelining, able to accelerate the training of large language models with up to 2.5 billion parameters by up to 6.9x compared to optimized baselines.

...read moreread less

Posted Content

Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark

Cody Coleman, +9 more

- 04 Jun 2018 -

arXiv: Learning

TL;DR: DAWNBENCH entries are analyzed to show that TTA has a low coefficient of variation and that models optimized for TTA generalize nearly as well as those trained using standard methods, and it is found that distributed entries can spend more than half of their time on communication.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Going deeper with convolutions

Christian Szegedy, +8 more

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Journal ArticleDOI

ImageNet classification with deep convolutional neural networks

Alex Krizhevsky, +2 more

- 24 May 2017 -

Communications of The ACM

TL;DR: A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.

...read moreread less

Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, +11 more

- 01 Dec 2015 -

International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Book

Computer Architecture: A Quantitative Approach

John L. Hennessy, +1 more

TL;DR: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today.

...read moreread less

Collapse

In-Datacenter Performance Analysis of a Tensor Processing Unit

Citations

A network-centric hardware/algorithm co-design to accelerate distributed training of deep neural networks

Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations

On the Opportunities and Risks of Foundation Models.

Memory-Efficient Pipeline-Parallel DNN Training

Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark

References

Going deeper with convolutions

ImageNet classification with deep convolutional neural networks

ImageNet Large Scale Visual Recognition Challenge

Mastering the game of Go with deep neural networks and tree search

Computer Architecture: A Quantitative Approach

Related Papers (5)

Deep Residual Learning for Image Recognition

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Going deeper with convolutions

Trending Questions (1)