scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Charge-Trap Transistors for CMOS-Only Analog Memory

15 Aug 2019-IEEE Transactions on Electron Devices (IEEE)-Vol. 66, Iss: 10, pp 4183-4187
TL;DR: A comprehensive investigation of the programming behavior of CTTs, including analog retention, intra- and inter-device variation, and fine-tuning of the device, both for individual devices and for devices in an integrated array reveals the promising future of using the CTT as a CMOS-only analog memory device.
Abstract: Since our demonstration of unsupervised learning using the CMOS-only charge-trap transistors (CTTs) as analog synapses, there has been an increasing interest in exploiting the device for various other neural network (NN) applications. However, most of these studies are limited to mere simulation due to the absence of detailed experimental device characterization. In this article, we provide a comprehensive investigation of the programming behavior of CTTs, including analog retention, intra- and inter-device variation, and fine-tuning of the device, both for individual devices and for devices in an integrated array. It is found that, after programming, the channel current gradually increases to a higher level, and the shift is larger when the device is programmed to a higher threshold voltage. With this postprogramming current increase appropriately accounted for, individual devices can be programmed to an equivalent precision of five bits, and three bits can be achieved for devices in an array. Our results reveal the promising future of using the CTT as a CMOS-only analog memory device.
Citations
More filters
Journal ArticleDOI
TL;DR: The drain-erase scheme is proposed to enable the individual cell’s program/erase/inhibition, which is necessary for individual weight updates in in situ training, and the VMM operation is simulated in a 3-D NAND-like FeFET array.
Abstract: Ferroelectric-doped HfO2-based ferroelectric field-effect transistors (FeFETs) are being actively explored as emerging nonvolatile memory (NVM) devices with the potential for in-memory computing. In this two-part article, we explore the feasibility of the FeFET-based 3-D NAND architecture for both in situ training and inference. To address the challenge of erase-by-block in a NAND-like structure, we propose and experimentally demonstrate the drain-erase scheme to enable the individual cell’s program/erase/inhibition, which is necessary for individual weight updates in in situ training. We described the device characterization of different drain-erase conditions and results in Part I. The array-level design for this drain-erase scheme for both AND-type and NAND-type array is addressed in this Part II. A 3-D vertical channel FeFET array architecture is proposed to accelerate the vector-matrix multiplication (VMM). 3-D timing sequence of the weight update rule is designed and verified through the 3-D-array-level SPICE simulation. Finally, the VMM operation is simulated in a 3-D NAND-like FeFET array.

31 citations


Cites methods from "Charge-Trap Transistors for CMOS-On..."

  • ...Alternatively, there are approaches using charge-trap-transistor [8], 2-D NOR Flash [9], 2-D NAND Flash [10], or 3-D AND Flash [11] to implement DNNs leveraging their high density....

    [...]

Journal ArticleDOI
TL;DR: The read disturb-induced conductance drift characteristic is statistically measured on a test vehicle based on 2-bit HfO2 RRAM array and a bipolar read scheme is proposed and tested to enhance the resilience against the read disturb.
Abstract: The multilevel resistive random access memory (RRAM)-based synaptic array can enable parallel computations of vector–matrix multiplication for machine learning inference acceleration; however, any conductance drift of the cell may induce an inference accuracy drop because the analog current is summed up along the column. In this article, the read disturb-induced conductance drift characteristic is statistically measured on a test vehicle based on 2-bit HfO2 RRAM array. The drift behavior of four states is empirically modeled by a vertical and lateral filament growth mechanism. Furthermore, a bipolar read scheme is proposed and tested to enhance the resilience against the read disturb. The modeled read disturb and proposed compensation scheme are incorporated into a VGG-like convolutional neural network for CIFAR-10 data set inference.

27 citations


Cites background from "Charge-Trap Transistors for CMOS-On..."

  • ...(PCRAM) [7], [8], flash memory [9]–[12], as a synaptic device...

    [...]

Journal ArticleDOI
TL;DR: In this paper, a 2T-1FeFET synaptic cell design that improves its in situ training accuracy to approach software baseline is presented. And the FeFET drain-erase scheme for array-level operations is introduced to make the in- situ training feasible for FeFet-based hardware accelerator.
Abstract: Recent discovery of ferroelectricity in doped HfO2 has reignited research interest in the ferroelectric field-effect transistor (FeFET) as emerging embedded nonvolatile memory with the potential for neuro-inspired computing. This paper reviews two major aspects for its application in neuro-inspired computing: ferroelectric devices as multilevel synaptic devices and the circuit primitive design with FeFET for in-memory computing. First, the authors survey representative FeFET-based synaptic devices. Then, the authors introduce 2T-1FeFET synaptic cell design that improves its in situ training accuracy to approach software baseline. Then, the authors introduce the FeFET drain–erase scheme for array-level operations, which makes the in situ training feasible for FeFET-based hardware accelerator. Finally, the authors give an outlook on the future 3D-integrated 2T-1FeFET design.

20 citations

Journal ArticleDOI
TL;DR: Efficient non-volatile memory devices based on hybrid organic-inorganic perovskite (CH3NH3PbI3) as a resistive switching layer on a Glass/Indium Tin Oxide (ITO) substrate and this device could be integrated inside a photovoltaic array to work as a power-on-chip device, where generation and computation could be possible on the same substrate for memory and neuromorphic applications.
Abstract: Recent research is a testimony to the fact that perovskite material based solar cells are most efficient since they exhibit high power conversion efficiency and low cost of fabrication. Various perovskite materials display hysteresis in their current-voltage characteristic which accounts for memory behaviour. In this paper, we demonstrate efficient non-volatile memory devices based on hybrid organic-inorganic perovskite (CH3NH3PbI3) as a resistive switching layer on a Glass/Indium Tin Oxide (ITO) substrate. Our perovskite solar cells have been developed over the fully solution processed electron transport layer (ETL) which is a combination of SnO2 and mesoporous (m)-TiO2 scaffold layers. Hysteresis behaviour was observed in the current-voltage analysis achieving high ratio of ON & OFF current under dark and ambient conditions. Proposed perovskite-based Glass/ITO/SnO2/m-TiO2/CH3NH3PbI3/Au device has a hole transport layer (HTL) free structure, which is mainly responsible for a large ratio of ON & OFF current. The presence of voids in the scaffold m-TiO2 layer are also accountable for increasing electron/hole path length which escalates the recombination rate at the surface of the ETL/perovskite interface resulting in large hysteresis in the I-V curve. This memristor device operates at a low energy due to SnO2 layer's higher electron mobility and wide energy band gap. Our experimental results also show the dependency of voltage scan range & rate of scanning on the hysteresis behaviour in dark conditions. This memristive behaviour of the proposed device depicts drift in hysteresis loop with respect to the number of cycles, which would have a significant impact in neuromorphic applications. Moreover, due to the identical fabrication process of the proposed perovskite-based memristor device and perovskite solar cells, this device could be integrated inside a photovoltaic array to work as a power-on-chip device, where generation and computation could be possible on the same substrate for memory and neuromorphic applications.

18 citations

Journal ArticleDOI
TL;DR: In this article, a physics-based phase-field multidomain switching model is used to understand the origin of ferroelectric partial switching, and a possible mitigation strategy is proposed.
Abstract: Doped HfO2-based ferroelectric field-effect transistor (FeFET) is being actively explored as an emerging nonvolatile memory device with the potential for in-memory computing. In this work, we identify a new challenge of ferroelectric partial switching, namely “history effect” in minor loop dynamics. We experimentally demonstrate the minor loop dynamics in both ferroelectric capacitor (FeCap) and 28-nm FeFET in Part I. In this article, a physics-based phase-field multidomain switching model is used to understand the origin. Even though the device may have the same polarization state that is externally observable, its internal domain configuration varies depending on its history. We model such history effect into the FeFET-based neural network simulation and analyze its negative impact on the training accuracy and then propose a possible mitigation strategy.

17 citations


Cites methods from "Charge-Trap Transistors for CMOS-On..."

  • ...Alternatively, there are approaches using charge-trap transistor [10], 2-D NOR Flash [11], 2-D NAND Flash [12], or even 3-D NAND/AND Flash [13], [14] to implement DNNs leveraging their mature fabrication technology and high density....

    [...]

References
More filters
Journal ArticleDOI
28 May 2015-Nature
TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.
Abstract: Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

46,982 citations

Journal ArticleDOI
08 Aug 2014-Science
TL;DR: Inspired by the brain’s structure, an efficient, scalable, and flexible non–von Neumann architecture is developed that leverages contemporary silicon technology and is well suited to many applications that use complex neural networks in real time, for example, multiobject detection and classification.
Abstract: Inspired by the brain’s structure, we have developed an efficient, scalable, and flexible non–von Neumann architecture that leverages contemporary silicon technology. To demonstrate, we built a 5.4-billion-transistor chip with 4096 neurosynaptic cores interconnected via an intrachip network that integrates 1 million programmable spiking neurons and 256 million configurable synapses. Chips can be tiled in two dimensions via an interchip communication interface, seamlessly scaling the architecture to a cortexlike sheet of arbitrary size. The architecture is well suited to many applications that use complex neural networks in real time, for example, multiobject detection and classification. With 400-pixel-by-240-pixel video input at 30 frames per second, the chip consumes 63 milliwatts.

3,253 citations


"Charge-Trap Transistors for CMOS-On..." refers background in this paper

  • ...synapses) in physical proximity to the processor, thereby making the computation local [10]–[13]....

    [...]

Posted Content
TL;DR: This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.
Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

3,067 citations

Proceedings ArticleDOI
24 Jun 2017
TL;DR: The Tensor Processing Unit (TPU) as discussed by the authors is a custom ASIC deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) using a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS).
Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

2,679 citations

Journal ArticleDOI
18 Jun 2016
TL;DR: In this paper, the authors proposed an energy efficient inference engine (EIE) that performs inference on a compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing.
Abstract: State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power.Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120× energy saving; Exploiting sparsity saves 10×; Weight sharing gives 8×; Skipping zero activations from ReLU saves another 3×. Evaluated on nine DNN benchmarks, EIE is 189× and 13× faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88×104 frames/sec with a power dissipation of only 600mW. It is 24,000× and 3,400× more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9×, 19× and 3× better throughput, energy efficiency and area efficiency.

2,445 citations