Home
/
Authors
/
Kailash Gopalakrishnan

Author

Kailash Gopalakrishnan

Bio: Kailash Gopalakrishnan is an academic researcher from IBM. The author has contributed to research in topics: Deep learning & Artificial neural network. The author has an hindex of 31, co-authored 108 publications receiving 7754 citations. Previous affiliations of Kailash Gopalakrishnan include Stanford University.

Papers published on a yearly basis

2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2008
2007
2006
2005
2004
2003
2002

Papers

PDF

Open Access

More filters

Posted Content•

Deep Learning with Limited Numerical Precision

[...]

Suyog Gupta¹, Ankur Agrawal¹, Kailash Gopalakrishnan¹, Pritish Narayanan¹•Institutions (1)

IBM¹

09 Feb 2015-arXiv: Learning

TL;DR: The results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy.

...read moreread less

Abstract: Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.

...read moreread less

1,234 citations

Proceedings Article•

Deep Learning with Limited Numerical Precision

[...]

Suyog Gupta¹, Ankur Agrawal¹, Kailash Gopalakrishnan¹, Pritish Narayanan¹•Institutions (1)

IBM¹

06 Jul 2015

TL;DR: In this article, the effect of limited precision data representation and computation on neural network training was studied, and it was shown that deep networks can be trained using only 16-bit wide fixed point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy.

...read moreread less

Abstract: Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of lowprecision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.

...read moreread less

1,142 citations

Journal Article•DOI•

Phase change memory technology

[...]

Geoffrey W. Burr¹, Matthew J. Breitwisch, Michele M. Franceschini, Davide Garetto, Kailash Gopalakrishnan, Bryan L. Jackson, B. N. Kurdi, Chung H. Lam, Luis A. Lastras, Alvaro Padilla, Bipin Rajendran, Simone Raoux, Rohit S. Shenoy - Show less +9 more•Institutions (1)

IBM¹

19 Mar 2010-Journal of Vacuum Science & Technology B

TL;DR: In this article, the authors survey the current state of phase change memory (PCM), a nonvolatile solid-state memory technology built around the large electrical contrast between the highly resistive amorphous and highly conductive crystalline states in so-called phase change materials.

...read moreread less

Abstract: The authors survey the current state of phase change memory (PCM), a nonvolatile solid-state memory technology built around the large electrical contrast between the highly resistive amorphous and highly conductive crystalline states in so-called phase change materials. PCM technology has made rapid progress in a short time, having passed older technologies in terms of both sophisticated demonstrations of scaling to small device dimensions, as well as integrated large-array demonstrators with impressive retention, endurance, performance, and yield characteristics. They introduce the physics behind PCM technology, assess how its characteristics match up with various potential applications across the memory-storage hierarchy, and discuss its strengths including scalability and rapid switching speed. Challenges for the technology are addressed, including the design of PCM cells for low reset current, the need to control device-to-device variability, and undesirable changes in the phase change material that c...

...read moreread less

921 citations

Journal Article•DOI•

Overview of candidate device technologies for storage-class memory

[...]

Geoffrey W. Burr¹, B. N. Kurdi¹, J. C. Scott¹, Chung H. Lam¹, Kailash Gopalakrishnan¹, R. S. Shenoy¹ - Show less +2 more•Institutions (1)

IBM¹

01 Jul 2008-Ibm Journal of Research and Development

TL;DR: In this article, the authors review the candidate solid-state nonvolatile memory technologies that potentially could be used to construct a storage-class memory (SCM) and compare the potential for practical scaling to ultrahigh effective areal density for each of these candidate technologies.

...read moreread less

Abstract: Storage-class memory (SCM) combines the benefits of a solid-state memory, such as high performance and robustness, with the archival capabilities and low cost of conventional hard-disk magnetic storage. Such a device would require a solid-state nonvolatile memory technology that could be manufactured at an extremely high effective areal density using some combination of sublithographic patterning techniques, multiple bits per cell, and multiple layers of devices. We review the candidate solid-state nonvolatile memory technologies that potentially could be used to construct such an SCM. We discuss evolutionary extensions of conventional flash memory, such as SONOS (silicon-oxide-nitride-oxide-silicon) and nanotraps, as well as a number of revolutionary new memory technologies. We review the capabilities of ferroelectric, magnetic, phase-change, and resistive random-access memories, including perovskites and solid electrolytes, and finally organic and polymeric memory. The potential for practical scaling to ultrahigh effective areal density for each of these candidate technologies is then compared.

...read moreread less

659 citations

Posted Content•

PACT: Parameterized Clipping Activation for Quantized Neural Networks

[...]

Jungwook Choi, Zhuo Wang¹, Swagath Venkataramani¹, Pierce I.Jen Chuang¹, Vijayalakshmi Srinivasan¹, Kailash Gopalakrishnan¹ - Show less +2 more•Institutions (1)

IBM¹

15 Feb 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: It is shown, for the first time, that both weights and activations can be quantized to 4-bits of precision while still achieving accuracy comparable to full precision networks across a range of popular models and datasets.

...read moreread less

Abstract: Deep learning algorithms achieve high classification accuracy at the expense of significant computation cost To address this cost, a number of quantization schemes have been proposed - but most of these techniques focused on quantizing weights, which are relatively smaller in size compared to activations This paper proposes a novel quantization scheme for activations during training - that enables neural networks to work well with ultra low precision weights and activations without any significant accuracy degradation This technique, PArameterized Clipping acTivation (PACT), uses an activation clipping parameter $\alpha$ that is optimized during training to find the right quantization scale PACT allows quantizing activations to arbitrary bit precisions, while achieving much better accuracy relative to published state-of-the-art quantization schemes We show, for the first time, that both weights and activations can be quantized to 4-bits of precision while still achieving accuracy comparable to full precision networks across a range of popular models and datasets We also show that exploiting these reduced-precision computational units in hardware can enable a super-linear improvement in inferencing performance due to a significant reduction in the area of accelerator compute engines coupled with the ability to retain the quantized model and activation data in on-chip memories

...read moreread less

645 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

Collapse

Cited by

PDF

Open Access

More filters

Book•

Deep Learning

[...]

Ian Goodfellow¹, Yoshua Bengio², Aaron Courville²•Institutions (2)

Google¹, Université de Montréal²

18 Nov 2016

TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.

...read moreread less

Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

...read moreread less

38,208 citations

Posted Content•

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

[...]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason A. Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg S. Corrado, Macduff Hughes, Jeffrey Dean - Show less +27 more

26 Sep 2016-arXiv: Computation and Language

TL;DR: GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.

...read moreread less

Abstract: Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.

...read moreread less

5,737 citations

Posted Content•

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

[...]

Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf

02 Oct 2019-arXiv: Computation and Language

TL;DR: This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.

...read moreread less

Abstract: As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

...read moreread less

3,877 citations

Posted Content•

In-Datacenter Performance Analysis of a Tensor Processing Unit

[...]

Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Albert T. Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Christopher Aaron Clark, Jeremy Coriell, Michael J. Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William John Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, D. Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Khaitan Harshit, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andrew Everett Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Michael Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay K. Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, Doe Hyun Yoon - Show less +71 more

16 Apr 2017-arXiv: Hardware Architecture

TL;DR: This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.

...read moreread less

Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

...read moreread less

3,067 citations

Journal Article•DOI•

Memristive devices for computing

[...]

Jianhua Yang¹, Dmitri B. Strukov², Duncan Stewart³•Institutions (3)

Hewlett-Packard¹, University of California, Santa Barbara², National Research Council³

01 Jan 2013-Nature Nanotechnology

TL;DR: The performance requirements for computing with memristive devices are examined and how the outstanding challenges could be met are examined.

...read moreread less

Abstract: Memristive devices are electrical resistance switches that can retain a state of internal resistance based on the history of applied voltage and current. These devices can store and process information, and offer several key performance characteristics that exceed conventional integrated circuit technology. An important class of memristive devices are two-terminal resistance switches based on ionic motion, which are built from a simple conductor/insulator/conductor thin-film stack. These devices were originally conceived in the late 1960s and recent progress has led to fast, low-energy, high-endurance devices that can be scaled down to less than 10 nm and stacked in three dimensions. However, the underlying device mechanisms remain unclear, which is a significant barrier to their widespread application. Here, we review recent progress in the development and understanding of memristive devices. We also examine the performance requirements for computing with memristive devices and detail how the outstanding challenges could be met.

...read moreread less

3,037 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse