Charge-Trap Transistors for CMOS-Only Analog Memory

doi:10.1109/TED.2019.2933484

Home
/
Papers
/
Charge-Trap Transistors for CMOS-Only Analog Memory

Journal Article•DOI•

Charge-Trap Transistors for CMOS-Only Analog Memory

Xuefeng Gu¹, Zhe Wan¹, Subramanian S. Iyer¹•Institutions (1)

University of California, Los Angeles¹

15 Aug 2019-IEEE Transactions on Electron Devices (IEEE)-Vol. 66, Iss: 10, pp 4183-4187

TL;DR: A comprehensive investigation of the programming behavior of CTTs, including analog retention, intra- and inter-device variation, and fine-tuning of the device, both for individual devices and for devices in an integrated array reveals the promising future of using the CTT as a CMOS-only analog memory device.

read less

Abstract: Since our demonstration of unsupervised learning using the CMOS-only charge-trap transistors (CTTs) as analog synapses, there has been an increasing interest in exploiting the device for various other neural network (NN) applications. However, most of these studies are limited to mere simulation due to the absence of detailed experimental device characterization. In this article, we provide a comprehensive investigation of the programming behavior of CTTs, including analog retention, intra- and inter-device variation, and fine-tuning of the device, both for individual devices and for devices in an integrated array. It is found that, after programming, the channel current gradually increases to a higher level, and the shift is larger when the device is programmed to a higher threshold voltage. With this postprogramming current increase appropriately accounted for, individual devices can be programmed to an equivalent precision of five bits, and three bits can be achieved for devices in an array. Our results reveal the promising future of using the CTT as a CMOS-only analog memory device.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Drain-Erase Scheme in Ferroelectric Field Effect Transistor—Part II: 3-D-NAND Architecture for In-Memory Computing

[...]

Panni Wang¹, Wonbo Shim¹, Zheng Wang¹, Jae Hur¹, Suman Datta², Asif Islam Khan¹, Shimeng Yu¹ - Show less +3 more•Institutions (2)

Georgia Institute of Technology¹, University of Notre Dame²

14 Feb 2020-IEEE Transactions on Electron Devices

TL;DR: The drain-erase scheme is proposed to enable the individual cell’s program/erase/inhibition, which is necessary for individual weight updates in in situ training, and the VMM operation is simulated in a 3-D NAND-like FeFET array.

...read moreread less

Abstract: Ferroelectric-doped HfO2-based ferroelectric field-effect transistors (FeFETs) are being actively explored as emerging nonvolatile memory (NVM) devices with the potential for in-memory computing. In this two-part article, we explore the feasibility of the FeFET-based 3-D NAND architecture for both in situ training and inference. To address the challenge of erase-by-block in a NAND-like structure, we propose and experimentally demonstrate the drain-erase scheme to enable the individual cell’s program/erase/inhibition, which is necessary for individual weight updates in in situ training. We described the device characterization of different drain-erase conditions and results in Part I. The array-level design for this drain-erase scheme for both AND-type and NAND-type array is addressed in this Part II. A 3-D vertical channel FeFET array architecture is proposed to accelerate the vector-matrix multiplication (VMM). 3-D timing sequence of the weight update rule is designed and verified through the 3-D-array-level SPICE simulation. Finally, the VMM operation is simulated in a 3-D NAND-like FeFET array.

...read moreread less

31 citations

Cites methods from "Charge-Trap Transistors for CMOS-On..."

...Alternatively, there are approaches using charge-trap-transistor [8], 2-D NOR Flash [9], 2-D NAND Flash [10], or 3-D AND Flash [11] to implement DNNs leveraging their high density....
[...]

Journal Article•DOI•

Investigation of Read Disturb and Bipolar Read Scheme on Multilevel RRAM-Based Deep Learning Inference Engine

[...]

Wonbo Shim¹, Yandong Luo¹, Jae-sun Seo², Shimeng Yu¹•Institutions (2)

Georgia Institute of Technology¹, Arizona State University²

17 Apr 2020-IEEE Transactions on Electron Devices

TL;DR: The read disturb-induced conductance drift characteristic is statistically measured on a test vehicle based on 2-bit HfO2 RRAM array and a bipolar read scheme is proposed and tested to enhance the resilience against the read disturb.

...read moreread less

Abstract: The multilevel resistive random access memory (RRAM)-based synaptic array can enable parallel computations of vector–matrix multiplication for machine learning inference acceleration; however, any conductance drift of the cell may induce an inference accuracy drop because the analog current is summed up along the column. In this article, the read disturb-induced conductance drift characteristic is statistically measured on a test vehicle based on 2-bit HfO2 RRAM array. The drift behavior of four states is empirically modeled by a vertical and lateral filament growth mechanism. Furthermore, a bipolar read scheme is proposed and tested to enhance the resilience against the read disturb. The modeled read disturb and proposed compensation scheme are incorporated into a VGG-like convolutional neural network for CIFAR-10 data set inference.

...read moreread less

27 citations

Cites background from "Charge-Trap Transistors for CMOS-On..."

...(PCRAM) [7], [8], flash memory [9]–[12], as a synaptic device...
[...]

Journal Article•DOI•

Ferroelectric devices and circuits for neuro-inspired computing

[...]

Panni Wang¹, Shimeng Yu¹•Institutions (1)

Georgia Institute of Technology¹

01 Dec 2020-MRS Communications

TL;DR: In this paper, a 2T-1FeFET synaptic cell design that improves its in situ training accuracy to approach software baseline is presented. And the FeFET drain-erase scheme for array-level operations is introduced to make the in- situ training feasible for FeFet-based hardware accelerator.

...read moreread less

Abstract: Recent discovery of ferroelectricity in doped HfO2 has reignited research interest in the ferroelectric field-effect transistor (FeFET) as emerging embedded nonvolatile memory with the potential for neuro-inspired computing. This paper reviews two major aspects for its application in neuro-inspired computing: ferroelectric devices as multilevel synaptic devices and the circuit primitive design with FeFET for in-memory computing. First, the authors survey representative FeFET-based synaptic devices. Then, the authors introduce 2T-1FeFET synaptic cell design that improves its in situ training accuracy to approach software baseline. Then, the authors introduce the FeFET drain–erase scheme for array-level operations, which makes the in situ training feasible for FeFET-based hardware accelerator. Finally, the authors give an outlook on the future 3D-integrated 2T-1FeFET design.

...read moreread less

20 citations

Journal Article•DOI•

Investigation of hysteresis in hole transport layer free metal halide perovskites cells under dark conditions.

[...]

Vishal Gupta¹, Giulia Lucarelli¹, Sergio Castro-Hermosa¹, Thomas M. Brown¹, Marco Ottavi¹ - Show less +1 more•Institutions (1)

University of Rome Tor Vergata¹

30 Oct 2020-Nanotechnology

TL;DR: Efficient non-volatile memory devices based on hybrid organic-inorganic perovskite (CH3NH3PbI3) as a resistive switching layer on a Glass/Indium Tin Oxide (ITO) substrate and this device could be integrated inside a photovoltaic array to work as a power-on-chip device, where generation and computation could be possible on the same substrate for memory and neuromorphic applications.

...read moreread less

Abstract: Recent research is a testimony to the fact that perovskite material based solar cells are most efficient since they exhibit high power conversion efficiency and low cost of fabrication. Various perovskite materials display hysteresis in their current-voltage characteristic which accounts for memory behaviour. In this paper, we demonstrate efficient non-volatile memory devices based on hybrid organic-inorganic perovskite (CH3NH3PbI3) as a resistive switching layer on a Glass/Indium Tin Oxide (ITO) substrate. Our perovskite solar cells have been developed over the fully solution processed electron transport layer (ETL) which is a combination of SnO2 and mesoporous (m)-TiO2 scaffold layers. Hysteresis behaviour was observed in the current-voltage analysis achieving high ratio of ON & OFF current under dark and ambient conditions. Proposed perovskite-based Glass/ITO/SnO2/m-TiO2/CH3NH3PbI3/Au device has a hole transport layer (HTL) free structure, which is mainly responsible for a large ratio of ON & OFF current. The presence of voids in the scaffold m-TiO2 layer are also accountable for increasing electron/hole path length which escalates the recombination rate at the surface of the ETL/perovskite interface resulting in large hysteresis in the I-V curve. This memristor device operates at a low energy due to SnO2 layer's higher electron mobility and wide energy band gap. Our experimental results also show the dependency of voltage scan range & rate of scanning on the hysteresis behaviour in dark conditions. This memristive behaviour of the proposed device depicts drift in hysteresis loop with respect to the number of cycles, which would have a significant impact in neuromorphic applications. Moreover, due to the identical fabrication process of the proposed perovskite-based memristor device and perovskite solar cells, this device could be integrated inside a photovoltaic array to work as a power-on-chip device, where generation and computation could be possible on the same substrate for memory and neuromorphic applications.

...read moreread less

18 citations

Journal Article•DOI•

Investigating Ferroelectric Minor Loop Dynamics and History Effect—Part II: Physical Modeling and Impact on Neural Network Training

[...]

Panni Wang¹, Zheng Wang¹, Xiaoyu Sun¹, Jae Hur¹, Suman Datta², Asif Islam Khan¹, Shimeng Yu¹ - Show less +3 more•Institutions (2)

Georgia Institute of Technology¹, University of Notre Dame²

03 Aug 2020-IEEE Transactions on Electron Devices

TL;DR: In this article, a physics-based phase-field multidomain switching model is used to understand the origin of ferroelectric partial switching, and a possible mitigation strategy is proposed.

...read moreread less

Abstract: Doped HfO2-based ferroelectric field-effect transistor (FeFET) is being actively explored as an emerging nonvolatile memory device with the potential for in-memory computing. In this work, we identify a new challenge of ferroelectric partial switching, namely “history effect” in minor loop dynamics. We experimentally demonstrate the minor loop dynamics in both ferroelectric capacitor (FeCap) and 28-nm FeFET in Part I. In this article, a physics-based phase-field multidomain switching model is used to understand the origin. Even though the device may have the same polarization state that is externally observable, its internal domain configuration varies depending on its history. We model such history effect into the FeFET-based neural network simulation and analyze its negative impact on the training accuracy and then propose a possible mitigation strategy.

...read moreread less

17 citations

Cites methods from "Charge-Trap Transistors for CMOS-On..."

...Alternatively, there are approaches using charge-trap transistor [10], 2-D NOR Flash [11], 2-D NAND Flash [12], or even 3-D NAND/AND Flash [13], [14] to implement DNNs leveraging their mature fabrication technology and high density....
[...]

1
2
3
4
…

References

PDF

Open Access

More filters

Journal Article•DOI•

Deep learning

[...]

Yann LeCun¹, Yann LeCun², Yoshua Bengio³, Geoffrey E. Hinton⁴, Geoffrey E. Hinton⁵ - Show less +1 more•Institutions (5)

New York University¹, Facebook², Université de Montréal³, Google⁴, University of Toronto⁵

28 May 2015-Nature

TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.

...read moreread less

Abstract: Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

...read moreread less

46,982 citations

Journal Article•DOI•

A million spiking-neuron integrated circuit with a scalable communication network and interface

[...]

Paul A. Merolla¹, John V. Arthur¹, Rodrigo Alvarez-Icaza¹, Andrew S. Cassidy¹, Jun Sawada¹, Filipp Akopyan¹, Bryan L. Jackson¹, Nabil Imam², Chen Guo¹, Yutaka Nakamura¹, Bernard Brezzo¹, Ivan Vo¹, Steven K. Esser¹, Rathinakumar Appuswamy¹, Brian Taba¹, Arnon Amir¹, Myron D. Flickner¹, William P. Risk¹, Rajit Manohar², Dharmendra S. Modha¹ - Show less +16 more•Institutions (2)

IBM¹, Cornell University²

08 Aug 2014-Science

TL;DR: Inspired by the brain’s structure, an efficient, scalable, and flexible non–von Neumann architecture is developed that leverages contemporary silicon technology and is well suited to many applications that use complex neural networks in real time, for example, multiobject detection and classification.

...read moreread less

Abstract: Inspired by the brain’s structure, we have developed an efficient, scalable, and flexible non–von Neumann architecture that leverages contemporary silicon technology. To demonstrate, we built a 5.4-billion-transistor chip with 4096 neurosynaptic cores interconnected via an intrachip network that integrates 1 million programmable spiking neurons and 256 million configurable synapses. Chips can be tiled in two dimensions via an interchip communication interface, seamlessly scaling the architecture to a cortexlike sheet of arbitrary size. The architecture is well suited to many applications that use complex neural networks in real time, for example, multiobject detection and classification. With 400-pixel-by-240-pixel video input at 30 frames per second, the chip consumes 63 milliwatts.

...read moreread less

3,253 citations

"Charge-Trap Transistors for CMOS-On..." refers background in this paper

...synapses) in physical proximity to the processor, thereby making the computation local [10]–[13]....
[...]

Posted Content•

In-Datacenter Performance Analysis of a Tensor Processing Unit

[...]

Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Albert T. Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Christopher Aaron Clark, Jeremy Coriell, Michael J. Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William John Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, D. Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Khaitan Harshit, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andrew Everett Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Michael Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay K. Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, Doe Hyun Yoon - Show less +71 more

16 Apr 2017-arXiv: Hardware Architecture

TL;DR: This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.

...read moreread less

Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

...read moreread less

3,067 citations

Proceedings Article•DOI•

In-Datacenter Performance Analysis of a Tensor Processing Unit

[...]

Norman P. Jouppi¹, Cliff Young¹, Nishant Patil¹, David A. Patterson¹, Gaurav Agrawal¹, Raminder Bajwa¹, Sarah Bates¹, Suresh Bhatia¹, Nan Boden¹, Albert T. Borchers¹, Rick Boyle¹, Pierre-luc Cantin¹, Clifford Chao¹, Christopher Aaron Clark¹, Jeremy Coriell¹, Michael J. Daley¹, Matt Dau¹, Jeffrey Dean¹, Ben Gelb¹, Tara Vazir Ghaemmaghami¹, Rajendra Gottipati¹, William John Gulland¹, Robert Hagmann¹, C. Richard Ho¹, Doug Hogberg¹, John Hu¹, Robert Hundt¹, D. Hurt¹, Julian Ibarz¹, Aaron Jaffey¹, Alek Jaworski¹, Alexander Kaplan¹, Khaitan Harshit¹, Daniel Killebrew¹, Andy Koch¹, Naveen Kumar¹, Steve Lacy¹, James Laudon¹, James Law¹, Diemthu Le¹, Chris Leary¹, Zhuyuan Liu¹, Kyle Lucke¹, Alan Lundin¹, Gordon MacKean¹, Adriana Maggiore¹, Maire Mahony¹, Kieran Miller¹, Rahul Nagarajan¹, Ravi Narayanaswami¹, Ray Ni¹, Kathy Nix¹, Thomas Norrie¹, Mark Omernick¹, Narayana Penukonda¹, Andrew Everett Phelps¹, Jonathan Ross¹, Matt Ross¹, Amir Salek¹, Emad Samadiani¹, Chris Severn¹, Gregory Sizikov¹, Matthew Snelham¹, Jed Souter¹, Dan Steinberg¹, Andy Swing¹, Mercedes Tan¹, Gregory Michael Thorson¹, Bo Tian¹, Horia Toma¹, Erick Tuttle¹, Vijay K. Vasudevan¹, Richard Walter¹, Walter Wang¹, Eric Wilcox¹, Doe Hyun Yoon¹ - Show less +72 more•Institutions (1)

Google¹

24 Jun 2017

TL;DR: The Tensor Processing Unit (TPU) as discussed by the authors is a custom ASIC deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) using a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS).

...read moreread less

Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

...read moreread less

2,679 citations

Journal Article•DOI•

EIE: efficient inference engine on compressed deep neural network

[...]

Song Han¹, Xingyu Liu¹, Huizi Mao¹, Jing Pu¹, Ardavan Pedram¹, Mark Horowitz¹, William J. Dally¹ - Show less +3 more•Institutions (1)

Stanford University¹

18 Jun 2016

TL;DR: In this paper, the authors proposed an energy efficient inference engine (EIE) that performs inference on a compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing.

...read moreread less

Abstract: State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power.Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120× energy saving; Exploiting sparsity saves 10×; Weight sharing gives 8×; Skipping zero activations from ReLU saves another 3×. Evaluated on nine DNN benchmarks, EIE is 189× and 13× faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88×104 frames/sec with a power dissipation of only 600mW. It is 24,000× and 3,400× more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9×, 19× and 3× better throughput, energy efficiency and area efficiency.

...read moreread less

2,445 citations