scispace - formally typeset
Search or ask a question
Author

Kailash Gopalakrishnan

Other affiliations: Stanford University
Bio: Kailash Gopalakrishnan is an academic researcher from IBM. The author has contributed to research in topics: Deep learning & Artificial neural network. The author has an hindex of 31, co-authored 108 publications receiving 7754 citations. Previous affiliations of Kailash Gopalakrishnan include Stanford University.


Papers
More filters
Posted Content
TL;DR: The results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy.
Abstract: Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.

1,234 citations

Proceedings Article
06 Jul 2015
TL;DR: In this article, the effect of limited precision data representation and computation on neural network training was studied, and it was shown that deep networks can be trained using only 16-bit wide fixed point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy.
Abstract: Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of lowprecision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.

1,142 citations

Journal ArticleDOI
TL;DR: In this article, the authors survey the current state of phase change memory (PCM), a nonvolatile solid-state memory technology built around the large electrical contrast between the highly resistive amorphous and highly conductive crystalline states in so-called phase change materials.
Abstract: The authors survey the current state of phase change memory (PCM), a nonvolatile solid-state memory technology built around the large electrical contrast between the highly resistive amorphous and highly conductive crystalline states in so-called phase change materials. PCM technology has made rapid progress in a short time, having passed older technologies in terms of both sophisticated demonstrations of scaling to small device dimensions, as well as integrated large-array demonstrators with impressive retention, endurance, performance, and yield characteristics. They introduce the physics behind PCM technology, assess how its characteristics match up with various potential applications across the memory-storage hierarchy, and discuss its strengths including scalability and rapid switching speed. Challenges for the technology are addressed, including the design of PCM cells for low reset current, the need to control device-to-device variability, and undesirable changes in the phase change material that c...

921 citations

Journal ArticleDOI
Geoffrey W. Burr1, B. N. Kurdi1, J. C. Scott1, Chung H. Lam1, Kailash Gopalakrishnan1, R. S. Shenoy1 
TL;DR: In this article, the authors review the candidate solid-state nonvolatile memory technologies that potentially could be used to construct a storage-class memory (SCM) and compare the potential for practical scaling to ultrahigh effective areal density for each of these candidate technologies.
Abstract: Storage-class memory (SCM) combines the benefits of a solid-state memory, such as high performance and robustness, with the archival capabilities and low cost of conventional hard-disk magnetic storage. Such a device would require a solid-state nonvolatile memory technology that could be manufactured at an extremely high effective areal density using some combination of sublithographic patterning techniques, multiple bits per cell, and multiple layers of devices. We review the candidate solid-state nonvolatile memory technologies that potentially could be used to construct such an SCM. We discuss evolutionary extensions of conventional flash memory, such as SONOS (silicon-oxide-nitride-oxide-silicon) and nanotraps, as well as a number of revolutionary new memory technologies. We review the capabilities of ferroelectric, magnetic, phase-change, and resistive random-access memories, including perovskites and solid electrolytes, and finally organic and polymeric memory. The potential for practical scaling to ultrahigh effective areal density for each of these candidate technologies is then compared.

659 citations

Posted Content
TL;DR: It is shown, for the first time, that both weights and activations can be quantized to 4-bits of precision while still achieving accuracy comparable to full precision networks across a range of popular models and datasets.
Abstract: Deep learning algorithms achieve high classification accuracy at the expense of significant computation cost To address this cost, a number of quantization schemes have been proposed - but most of these techniques focused on quantizing weights, which are relatively smaller in size compared to activations This paper proposes a novel quantization scheme for activations during training - that enables neural networks to work well with ultra low precision weights and activations without any significant accuracy degradation This technique, PArameterized Clipping acTivation (PACT), uses an activation clipping parameter $\alpha$ that is optimized during training to find the right quantization scale PACT allows quantizing activations to arbitrary bit precisions, while achieving much better accuracy relative to published state-of-the-art quantization schemes We show, for the first time, that both weights and activations can be quantized to 4-bits of precision while still achieving accuracy comparable to full precision networks across a range of popular models and datasets We also show that exploiting these reduced-precision computational units in hardware can enable a super-linear improvement in inferencing performance due to a significant reduction in the area of accelerator compute engines coupled with the ability to retain the quantized model and activation data in on-chip memories

645 citations


Cited by
More filters
Book
18 Nov 2016
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

38,208 citations

Posted Content
TL;DR: GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.
Abstract: Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.

5,737 citations

Posted Content
TL;DR: This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
Abstract: As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

3,877 citations

Posted Content
TL;DR: This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.
Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

3,067 citations

Journal ArticleDOI
TL;DR: The performance requirements for computing with memristive devices are examined and how the outstanding challenges could be met are examined.
Abstract: Memristive devices are electrical resistance switches that can retain a state of internal resistance based on the history of applied voltage and current. These devices can store and process information, and offer several key performance characteristics that exceed conventional integrated circuit technology. An important class of memristive devices are two-terminal resistance switches based on ionic motion, which are built from a simple conductor/insulator/conductor thin-film stack. These devices were originally conceived in the late 1960s and recent progress has led to fast, low-energy, high-endurance devices that can be scaled down to less than 10 nm and stacked in three dimensions. However, the underlying device mechanisms remain unclear, which is a significant barrier to their widespread application. Here, we review recent progress in the development and understanding of memristive devices. We also examine the performance requirements for computing with memristive devices and detail how the outstanding challenges could be met.

3,037 citations