scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Distributed In-Memory Computing on Binary RRAM Crossbar

Leibin Ni1, Hantao Huang1, Zichuan Liu1, Rajiv V. Joshi2, Hao Yu1 
17 Mar 2017-ACM Journal on Emerging Technologies in Computing Systems (ACM)-Vol. 13, Iss: 3, pp 36
TL;DR: Based on numerical results for fingerprint matching that is mapped on the proposed RRAM-crossbar, the proposed architecture has shown 2.86x faster speed, 154x better energy efficiency, and 100x smaller area when compared to the same design by CMOS-based ASIC.
Abstract: The recently emerging resistive random-access memory (RRAM) can provide nonvolatile memory storage but also intrinsic computing for matrix-vector multiplication, which is ideal for the low-power and high-throughput data analytics accelerator performed in memory. However, the existing RRAM crossbar--based computing is mainly assumed as a multilevel analog computing, whose result is sensitive to process nonuniformity as well as additional overhead from AD-conversion and I/O. In this article, we explore the matrix-vector multiplication accelerator on a binary RRAM crossbar with adaptive 1-bit-comparator--based parallel conversion. Moreover, a distributed in-memory computing architecture is also developed with the according control protocol. Both memory array and logic accelerator are implemented on the binary RRAM crossbar, where the logic-memory pair can be distributed with the control bus protocol. Experimental results have shown that compared to the analog RRAM crossbar, the proposed binary RRAM crossbar can achieve significant area savings with better calculation accuracy. Moreover, significant speedup can be achieved for matrix-vector multiplication in neural network--based machine learning such that the overall training and testing time can be both reduced. In addition, large energy savings can be also achieved when compared to the traditional CMOS-based out-of-memory computing architecture.
Citations
More filters
Journal ArticleDOI
TL;DR: This paper addresses data movement via an in-memory-computing accelerator that employs charged-domain mixed-signal operation for enhancing compute SNR and, thus, scalability in large-scale matrix-vector multiplications.
Abstract: Large-scale matrix-vector multiplications, which dominate in deep neural networks (DNNs), are limited by data movement in modern VLSI technologies. This paper addresses data movement via an in-memory-computing accelerator that employs charged-domain mixed-signal operation for enhancing compute SNR and, thus, scalability. The architecture supports analog/binary input activation (IA)/weight first layer (FL) and binary/binary IA/weight hidden layers (HLs), with batch normalization and input–output (IO) (buffering) circuitry to enable cascading, if desired, for realizing different DNN layers. The architecture is arranged as $8\times 8=64$ in-memory-computing neuron tiles, supporting up to 512, $3\times 3\times 512$ -input HL neurons and 64, $3\times 3\times 3$ -input FL neurons, configurable via tile-level clock gating. In-memory computing is achieved using an 8T bit cell with overlaying metal-oxide-metal (MOM) capacitor, yielding a structure having $1.8\times $ the area of a standard 6T bit cell. Implemented in 65-nm CMOS, the design achieves HLs/FL energy efficiency of 866/1.25 TOPS/W and throughput of 18876/43.2 GOPS (1498/3.43 GOPS/mm2), when implementing convolution layers; and 658/0.95 TOPS/W, 9438/10.47 GOPS (749/0.83 GOPS/mm2), when implementing convolution followed by batch normalization layers. Several large-scale neural networks are demonstrated, showing performance on standard benchmarks (MNIST, CIFAR-10, and SVHN) equivalent to ideal digital computing.

183 citations

Journal ArticleDOI
01 Jan 2021
TL;DR: This article defines the main figures of merit (FoMs) of analog RSM hardware including the basic device characteristics, hardware algorithms, and the corresponding mapping methods for device arrays, as well as the architecture and circuit design considerations for neural networks.
Abstract: In this article, we review the existing analog resistive switching memory (RSM) devices and their hardware technologies for in-memory learning, as well as their challenges and prospects. Since the characteristics of the devices are different for in-memory learning and digital memory applications, it is important to have an in-depth understanding across different layers from devices and circuits to architectures and algorithms. First, based on a top-down view from architecture to devices for analog computing, we define the main figures of merit (FoMs) and perform a comprehensive analysis of analog RSM hardware including the basic device characteristics, hardware algorithms, and the corresponding mapping methods for device arrays, as well as the architecture and circuit design considerations for neural networks. Second, we classify the FoMs of analog RSM devices into two levels. Level 1 FoMs are essential for achieving the functionality of a system (e.g., linearity, symmetry, dynamic range, level numbers, fluctuation, variability, and yield). Level 2 FoMs are those that make a functional system more efficient and reliable (e.g., area, operational voltage, energy consumption, speed, endurance, retention, and compatibility with back-end-of-line processing). By constructing a device-to-application simulation framework, we perform an in-depth analysis of how these FoMs influence in-memory learning and give a target list of the device requirements. Lastly, we evaluate the main FoMs of most existing devices with analog characteristics and review optimization methods from programming schemes to materials and device structures. The key challenges and prospects from the device to system level for analog RSM devices are discussed.

110 citations


Cites background from "Distributed In-Memory Computing on ..."

  • ...In addition, the advantages of high energy efficiency and high speed of these memory devices have led to the development of many designs of neural networks [41]–[43] and accelerators [44] in recent years....

    [...]

Journal ArticleDOI
TL;DR: This work explores and consolidates the various approaches that have been proposed to address the critical challenges faced by analog accelerators, for both neural network inference and training, and highlights the key design trade-offs underlying these techniques.
Abstract: Analog hardware accelerators, which perform computation within a dense memory array, have the potential to overcome the major bottlenecks faced by digital hardware for data-heavy workloads such as deep learning. Exploiting the intrinsic computational advantages of memory arrays, however, has proven to be challenging principally due to the overhead imposed by the peripheral circuitry and due to the non-ideal properties of memory devices that play the role of the synapse. We review the existing implementations of these accelerators for deep supervised learning, organizing our discussion around the different levels of the accelerator design hierarchy, with an emphasis on circuits and architecture. We explore and consolidate the various approaches that have been proposed to address the critical challenges faced by analog accelerators, for both neural network inference and training, and highlight the key design trade-offs underlying these techniques.

92 citations

Proceedings ArticleDOI
19 Mar 2018
TL;DR: Transistors with integrated ferroelectrics with device-level characteristics offer unique opportunities at the circuit, architectural, and system-level, and are considered here from device, circuit/architecture, and foundry-level perspectives.
Abstract: In this paper, we consider devices, circuits, and systems comprised of transistors with integrated ferroelectrics. Said structures are actively being considered by various semiconductor manufacturers as they can address a large and unique design space. Transistors with integrated ferroelectrics could (i) enable a better switch (i.e., offer steeper subthreshold swings), (ii) are CMOS compatible, (iii) have multiple operating modes (i.e., I-V characteristics can also enable compact, 1-transistor, non-volatile storage elements, as well as analog synaptic behavior), and (iv) have been experimentally demonstrated (i.e., with respect to all of the aforementioned operating modes). These device-level characteristics offer unique opportunities at the circuit, architectural, and system-level, and are considered here from device, circuit/architecture, and foundry-level perspectives.

70 citations

Journal ArticleDOI
TL;DR: In this paper, hybrid memristor-CMOS approaches have been proposed to implement large-scale neural networks with learning capabilities, offering a scalable and lower-cost alternative to existing CMOS systems.
Abstract: Inspired by biology, neuromorphic systems have been trying to emulate the human brain for decades, taking advantage of its massive parallelism and sparse information coding. Recently, several large-scale hardware projects have demonstrated the outstanding capabilities of this paradigm for applications related to sensory information processing. These systems allow for the implementation of massive neural networks with millions of neurons and billions of synapses. However, the realization of learning strategies in these systems consumes an important proportion of resources in terms of area and power. The recent development of nanoscale memristors that can be integrated with Complementary Metal–Oxide–Semiconductor (CMOS) technology opens a very promising solution to emulate the behavior of biological synapses. Therefore, hybrid memristor-CMOS approaches have been proposed to implement large-scale neural networks with learning capabilities, offering a scalable and lower-cost alternative to existing CMOS systems.

67 citations


Cites background from "Distributed In-Memory Computing on ..."

  • ...Energy savings in the order of 103–105 respect to baseline CPU implementations have been reported [153,155]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.
Abstract: We show how to use "complementary priors" to eliminate the explaining-away effects that make inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.

15,055 citations


"Distributed In-Memory Computing on ..." refers methods in this paper

  • ...Future cyber-physical system requires efficient real-time data analytics [Kouzes et al. 2009; Wolpert 1996; Hinton et al. 2006; Müller et al. 2008; Glorot and Bengio 2010] with applications in robotics, the brain–computer interface, and autonomous vehicles....

    [...]

Journal ArticleDOI
TL;DR: A new learning algorithm called ELM is proposed for feedforward neural networks (SLFNs) which randomly chooses hidden nodes and analytically determines the output weights of SLFNs which tends to provide good generalization performance at extremely fast learning speed.

10,217 citations


"Distributed In-Memory Computing on ..." refers methods in this paper

  • ...After feature extraction, one can perform various machine learning algorithms [Suykens and Vandewalle 1999; LeCun et al. 2012; Huang et al. 2006] for data analytics....

    [...]

  • ...One can solve (4) either by the iterative backward propagation method [Werbos 1990] or the direct L2-norm solver method for the least-squares problem [Huang et al. 2006]....

    [...]

Journal ArticleDOI
TL;DR: This work considers the problem of automatically recognizing human faces from frontal views with varying expression and illumination, as well as occlusion and disguise, and proposes a general classification algorithm for (image-based) object recognition based on a sparse representation computed by C1-minimization.
Abstract: We consider the problem of automatically recognizing human faces from frontal views with varying expression and illumination, as well as occlusion and disguise. We cast the recognition problem as one of classifying among multiple linear regression models and argue that new theory from sparse signal representation offers the key to addressing this problem. Based on a sparse representation computed by C1-minimization, we propose a general classification algorithm for (image-based) object recognition. This new framework provides new insights into two crucial issues in face recognition: feature extraction and robustness to occlusion. For feature extraction, we show that if sparsity in the recognition problem is properly harnessed, the choice of features is no longer critical. What is critical, however, is whether the number of features is sufficiently large and whether the sparse representation is correctly computed. Unconventional features such as downsampled images and random projections perform just as well as conventional features such as eigenfaces and Laplacianfaces, as long as the dimension of the feature space surpasses certain threshold, predicted by the theory of sparse representation. This framework can handle errors due to occlusion and corruption uniformly by exploiting the fact that these errors are often sparse with respect to the standard (pixel) basis. The theory of sparse representation helps predict how much occlusion the recognition algorithm can handle and how to choose the training images to maximize robustness to occlusion. We conduct extensive experiments on publicly available databases to verify the efficacy of the proposed algorithm and corroborate the above claims.

9,658 citations


"Distributed In-Memory Computing on ..." refers methods in this paper

  • ...For example, the feature extraction can be achieved by multiplying Bernoulli matrix in [10]....

    [...]

Proceedings Article
31 Mar 2010
TL;DR: The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.
Abstract: Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence. 1 Deep Neural Networks Deep learning methods aim at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features. They include Appearing in Proceedings of the 13 International Conference on Artificial Intelligence and Statistics (AISTATS) 2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of JMLR: WC Weston et al., 2008). Much attention has recently been devoted to them (see (Bengio, 2009) for a review), because of their theoretical appeal, inspiration from biology and human cognition, and because of empirical success in vision (Ranzato et al., 2007; Larochelle et al., 2007; Vincent et al., 2008) and natural language processing (NLP) (Collobert & Weston, 2008; Mnih & Hinton, 2009). Theoretical results reviewed and discussed by Bengio (2009), suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one may need deep architectures. Most of the recent experimental results with deep architecture are obtained with models that can be turned into deep supervised neural networks, but with initialization or training schemes different from the classical feedforward neural networks (Rumelhart et al., 1986). Why are these new algorithms working so much better than the standard random initialization and gradient-based optimization of a supervised training criterion? Part of the answer may be found in recent analyses of the effect of unsupervised pretraining (Erhan et al., 2009), showing that it acts as a regularizer that initializes the parameters in a “better” basin of attraction of the optimization procedure, corresponding to an apparent local minimum associated with better generalization. But earlier work (Bengio et al., 2007) had shown that even a purely supervised but greedy layer-wise procedure would give better results. So here instead of focusing on what unsupervised pre-training or semi-supervised criteria bring to deep architectures, we focus on analyzing what may be going wrong with good old (but deep) multilayer neural networks. Our analysis is driven by investigative experiments to monitor activations (watching for saturation of hidden units) and gradients, across layers and across training iterations. We also evaluate the effects on these of choices of activation function (with the idea that it might affect saturation) and initialization procedure (since unsupervised pretraining is a particular form of initialization and it has a drastic impact).

9,500 citations


"Distributed In-Memory Computing on ..." refers methods in this paper

  • ...Future cyber-physical system requires efficient real-time data analytics [Kouzes et al. 2009; Wolpert 1996; Hinton et al. 2006; Müller et al. 2008; Glorot and Bengio 2010] with applications in robotics, the brain–computer interface, and autonomous vehicles....

    [...]

Journal ArticleDOI
01 May 2008-Nature
TL;DR: It is shown, using a simple analytical example, that memristance arises naturally in nanoscale systems in which solid-state electronic and ionic transport are coupled under an external bias voltage.
Abstract: Anyone who ever took an electronics laboratory class will be familiar with the fundamental passive circuit elements: the resistor, the capacitor and the inductor. However, in 1971 Leon Chua reasoned from symmetry arguments that there should be a fourth fundamental element, which he called a memristor (short for memory resistor). Although he showed that such an element has many interesting and valuable circuit properties, until now no one has presented either a useful physical model or an example of a memristor. Here we show, using a simple analytical example, that memristance arises naturally in nanoscale systems in which solid-state electronic and ionic transport are coupled under an external bias voltage. These results serve as the foundation for understanding a wide range of hysteretic current-voltage behaviour observed in many nanoscale electronic devices that involve the motion of charged atomic or molecular species, in particular certain titanium dioxide cross-point switches.

8,971 citations


"Distributed In-Memory Computing on ..." refers background in this paper

  • ...Emerging resistive random-access memory (RRAM) [Akinaga and Shima 2010; Kim et al. 2011; Chua 1971; Williams 2008; Strukov et al. 2008; Shang et al. 2012; Fei et al. 2012; Wong et al. 2012] has shown great potential in being the solution for dataintensive applications....

    [...]