Open AccessProceedings Article
Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks
Urs Köster,Tristan J. Webb,Xin Wang,Marcel Nassar,Arjun K. Bansal,William Howard Constable,Oguz H. Elibol,Scott Gray,Stewart Hall,Luke Hornof,Amir Khosrowshahi,Carey K. Kloss,Ruby J. Pai,Naveen G. Rao +13 more
- Vol. 30, pp 1742-1752
Reads0
Chats0
TLDR
The 16-bit Flexpoint data format as discussed by the authors is a complete replacement of 32-bit floating point format training and inference, designed to support modern deep network topologies without modifications.Abstract:
Deep neural networks are commonly developed and trained in 32-bit floating point format. Significant gains in performance and energy efficiency could be realized by training and inference in numerical formats optimized for deep learning. Despite advances in limited precision inference in recent years, training of neural networks in low bit-width remains a challenging problem. Here we present the Flexpoint data format, aiming at a complete replacement of 32-bit floating point format training and inference, designed to support modern deep network topologies without modifications. Flexpoint tensors have a shared exponent that is dynamically adjusted to minimize overflows and maximize available dynamic range. We validate Flexpoint by training AlexNet, a deep residual network and a generative adversarial network, using a simulator implemented with the \emph{neon} deep learning framework. We demonstrate that 16-bit Flexpoint closely matches 32-bit floating point in training all three models, without any need for tuning of model hyperparameters. Our results suggest Flexpoint as a promising numerical format for future hardware for training and inference.read more
Citations
More filters
Proceedings Article
Zero-Shot Text-to-Image Generation
Aditya Ramesh,Mikhail Pavlov,Gabriel Goh,Scott Gray,Chelsea Voss,Alec Radford,Mark Chen,Ilya Sutskever +7 more
TL;DR: This work describes a simple approach based on a transformer that autoregressively models the text and image tokens as a single stream of data that is competitive with previous domain-specific models when evaluated in a zero-shot fashion.
Journal ArticleDOI
Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey
TL;DR: This article reviews the mainstream compression approaches such as compact model, tensor decomposition, data quantization, and network sparsification, and answers the question of how to leverage these methods in the design of neural network accelerators and present the state-of-the-art hardware architectures.
Proceedings ArticleDOI
A configurable cloud-scale DNN processor for real-time AI
Jeremy Fowers,Kalin Ovtcharov,Michael K. Papamichael,Todd Massengill,Ming Liu,Lo Daniel,Shlomi Alkalay,Michael Haselman,Logan Adams,Mahdi Ghandi,Stephen F. Heil,Prerak Patel,Adam Sapek,Gabriel Weisz,Lisa Woods,Sitaram Lanka,Steven K. Reinhardt,Adrian M. Caulfield,Eric S. Chung,Doug Burger +19 more
TL;DR: This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI, and achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1.5 teraflops.
Journal ArticleDOI
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis
Tal Ben-Nun,Torsten Hoefler +1 more
TL;DR: The problem of parallelization in DNNs is described from a theoretical perspective, followed by approaches for its parallelization, and potential directions for parallelism in deep learning are extrapolated.
Proceedings ArticleDOI
Data-Free Quantization Through Weight Equalization and Bias Correction
TL;DR: This work introduces a data-free quantization method for deep neural networks that does not require fine-tuning or hyperparameter selection, and achieves near-original model performance on common computer vision architectures and tasks.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article
ImageNet Classification with Deep Convolutional Neural Networks
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Posted Content
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Martín Abadi,Ashish Agarwal,Paul Barham,Eugene Brevdo,Zhifeng Chen,Craig Citro,Greg S. Corrado,Andy Davis,Jeffrey Dean,Matthieu Devin,Sanjay Ghemawat,Ian Goodfellow,Andrew Harp,Geoffrey Irving,Michael Isard,Yangqing Jia,Rafal Jozefowicz,Lukasz Kaiser,Manjunath Kudlur,Josh Levenberg,Dan Mané,Rajat Monga,Sherry Moore,Derek G. Murray,Chris Olah,Mike Schuster,Jonathon Shlens,Benoit Steiner,Ilya Sutskever,Kunal Talwar,Paul A. Tucker,Vincent Vanhoucke,Vijay K. Vasudevan,Fernanda B. Viégas,Oriol Vinyals,Pete Warden,Martin Wattenberg,Martin Wicke,Yuan Yu,Xiaoqiang Zheng +39 more
TL;DR: The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.
Book ChapterDOI
Identity Mappings in Deep Residual Networks
TL;DR: In this paper, the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation.
Posted Content
GANs Trained by a Two Time-Scale Update Rule Converge to a Nash Equilibrium
Martin Heusel,Hubert Ramsauer,Thomas Unterthiner,Bernhard Nessler,Günter Klambauer,Sepp Hochreiter +5 more
TL;DR: In this article, a two time-scale update rule (TTUR) was proposed for training GANs with stochastic gradient descent on arbitrary GAN loss functions, which has an individual learning rate for both the discriminator and the generator.